Why doesn’t the BBVI (Black Box Variational Inference) algorithm use back propagation?

July 4, 2018

Why doesn’t the BBVI (Black Box Variational Inference) algorithm use back propagation?

Filed under: Uncategorized — rrtucci @ 1:36 pm

Quantum Edward uses the BBVI training algorithm. Back Propagation, invented by Hinton, seems to be a fundamental part of most ANN (Artificial Neural Networks) training algorithms, where it is used to find gradients used to calculate the increment in the cost function during each iteration. Hence, I was very baffled, even skeptical, upon first encountering the BBVI algorithm, because it does not use back prop. The purpose of this blog post is to shed light on how BBVI can get away with this.

Before I start, let me explain what the terms “hidden (or latent) variable” and “hidden parameter” mean to AI researchers. Hidden variables are the opposite of “observed variables”. In Dustin Tran’s tutorials for Edward, he often represents observed variables by $x$ and hidden variables by $z$ . I will use $\theta$ instead of $z$ , so $z=\theta$ below. The data consists of many samples of the observed variable $x$ . The goal is to find a probability distribution for the hidden variables $\theta$ . A hidden parameter is a special type of hidden variable. In the language of Bayesian networks, a hidden parameter corresponds to a root node (one without any parents) whose node probability distribution is a Kronecker delta function, so, in effect, the node only ever achieves one of its possible states.

Next, we compare algos that use back prop to the BBVI algo, assuming the simplest case of a single hidden parameter $\theta$ (normally, there is more than one hidden parameter). We will assume $\theta\in [0, 1]$ . In quantum neural nets, the hidden parameters are angles by which qubits are rotated. Such angles range over a closed interval, for example, $[0, 2\pi]$ . After normalization of the angles, their ranges can be assumed, without loss of generality, to be $[0, 1]$ .

CASE1: Algorithms that use back prop.

Suppose $\theta \in [0, 1],\;\;\eta > 0.$ Consider a cost function $C$ and a model function $M$ such that

$C(\theta) = C(M(\theta)).$

If we define the change $d\theta$ in $\theta$ by

$d\theta = -\eta \frac{dC}{d\theta}= -\eta \frac{dC}{dM} \frac{dM}{d\theta},$

then the corresponding change in the cost is

$d C = d\theta \frac{dC}{d\theta} = -\eta \left( \frac{dC}{d\theta}\right)^2.$

This change in the cost is negative, which is what one wants if one wants to minimize the cost.

CASE2: BBVI algo

Suppose $\theta \in [0, 1],\;\;\eta > 0,\;\; \lambda > 0.$ Consider a reward function $R$ (for BBVI, $R$ = ELBO), a model function $M$ , and a distance function $dist(x, y)\geq 0$ such that

$R(\lambda) = R\left[\sum_\theta dist[M(\theta), P(\theta|\lambda)]\right].$

In the last expression, $P(\theta|\lambda)$ is a conditional probability distribution. More specifically, let us assume that $P(\theta|\lambda)$ is the Beta distribution. Check out its Wikipedia article

https://en.wikipedia.org/wiki/Beta_distribution

The Beta distribution depends on two positive parameters $\alpha, \beta$ (that is why it is called the Beta distribution). $\alpha, \beta$ are often called concentrations. Below, we will use the notation

$c_1 = \alpha > 0,$

$c_2 = \beta > 0,$

$\lambda = (c_1, c_2).$

Using this notation,

$P(\theta|\lambda) = {\rm Beta}(\theta; c_1, c_2).$

According to the Wikipedia article for the Beta distribution, the mean value of $\theta$ is given in terms of its 2 concentrations by the simple expression

$\langle\theta\rangle = \frac{c_1}{c_1 + c_2}.$

The variance of $\theta$ is given by a fairly simple expression of $c_1$ and $c_2$ too. Look it up in the Wikipedia article for the Beta distribution, if interested.

If we define the change $dc_j$ in the two concentrations by

$dc_j = \eta \frac{\partial R}{\partial c_j}$

for $j=1,2$ , then the change in the reward function $R$ will be

$dR = \sum_{j=1,2} dc_j \frac{\partial R}{\partial c_j}= \eta \sum_{j=1,2} \left(\frac{\partial R}{\partial c_j}\right)^2$

This change in the reward is positive, which is what one wants if one wants to maximize the reward.

Comparison of CASE1 and CASE2

In CASE1, we need to calculate the derivative of the model $M$ with respect to the hidden parameter $\theta$ :

$\frac{d}{d\theta}M(\theta).$

In CASE2, we do not need to calculate any derivatives at all of the model $M$ . (That is why it’s called a Black Box algo). We do have to calculate the derivative of $P(\theta|\lambda)$ with respect to $c_1$ and $c_2$ , but that can be done a priori since $P(\theta|\lambda)$ is known a priori to be the Beta distribution:

$\frac{d}{dc_j}\sum_\theta dist[M(\theta), P(\theta|\lambda)]= \sum_\theta \frac{d dist}{dP(\theta|\lambda)} \frac{dP(\theta|\lambda)}{dc_j}$

So, in conclusion, in CASE1, we try to find the value of $\theta$ directly. In CASE2, we try to find the parameters $c_1$ and $c_2$ which describe the distribution of $\theta$ ‘s. For an estimate of $\theta$ , just use $\langle \theta \rangle$ given above.

Quantum Bayesian Networks

July 4, 2018