Quantum Edward uses the BBVI training algorithm. Back Propagation, invented by Hinton, seems to be a fundamental part of most ANN (Artificial Neural Networks) training algorithms, where it is used to find gradients used to calculate the increment in the cost function during each iteration. Hence, I was very baffled, even skeptical, upon first encountering the BBVI algorithm, because it does not use back prop. The purpose of this blog post is to shed light on how BBVI can get away with this.

Before I start, let me explain what the terms “hidden (or latent) variable” and “hidden parameter” mean to AI researchers. Hidden variables are the opposite of “observed variables”. In Dustin Tran’s tutorials for Edward, he often represents observed variables by and hidden variables by . I will use instead of , so below. The data consists of many samples of the observed variable . The goal is to find a probability distribution for the hidden variables . A hidden parameter is a special type of hidden variable. In the language of Bayesian networks, a hidden parameter corresponds to a root node (one without any parents) whose node probability distribution is a Kronecker delta function, so, in effect, the node only ever achieves one of its possible states.

Next, we compare algos that use back prop to the BBVI algo, assuming the simplest case of a single hidden parameter (normally, there is more than one hidden parameter). We will assume . In quantum neural nets, the hidden parameters are angles by which qubits are rotated. Such angles range over a closed interval, for example, . After normalization of the angles, their ranges can be assumed, without loss of generality, to be .

CASE1: Algorithms that use back prop.

Suppose Consider a cost function and a model function such that

If we define the change in by

then the corresponding change in the cost is

This change in the cost is negative, which is what one wants if one wants to minimize the cost.

CASE2: BBVI algo

Suppose Consider a reward function (for BBVI, = ELBO), a model function , and a distance function such that

In the last expression, is a conditional probability distribution. More specifically, let us assume that is the Beta distribution. Check out its Wikipedia article

https://en.wikipedia.org/wiki/Beta_distribution

The Beta distribution depends on two positive parameters (that is why it is called the Beta distribution). are often called concentrations. Below, we will use the notation

Using this notation,

According to the Wikipedia article for the Beta distribution, the mean value of is given in terms of its 2 concentrations by the simple expression

The variance of is given by a fairly simple expression of and too. Look it up in the Wikipedia article for the Beta distribution, if interested.

If we define the change in the two concentrations by

for , then the change in the reward function will be

This change in the reward is positive, which is what one wants if one wants to maximize the reward.

Comparison of CASE1 and CASE2

In CASE1, we need to calculate the derivative of the model with respect to the hidden parameter :

In CASE2, we do not need to calculate any derivatives at all of the model . (That is why it’s called a Black Box algo). We do have to calculate the derivative of with respect to and , but that can be done a priori since is known a priori to be the Beta distribution:

So, in conclusion, in CASE1, we try to find the value of directly. In CASE2, we try to find the parameters and which describe the distribution of ‘s. For an estimate of , just use given above.