Quantum Bayesian Networks

July 4, 2018

Why doesn’t the BBVI (Black Box Variational Inference) algorithm use back propagation?

Filed under: Uncategorized — rrtucci @ 1:36 pm

Quantum Edward uses the BBVI training algorithm. Back Propagation, invented by Hinton, seems to be a fundamental part of most ANN (Artificial Neural Networks) training algorithms, where it is used to find gradients used to calculate the increment in the cost function during each iteration. Hence, I was very baffled, even skeptical, upon first encountering the BBVI algorithm, because it does not use back prop. The purpose of this blog post is to shed light on how BBVI can get away with this.

Before I start, let me explain what the terms “hidden (or latent) variable” and “hidden parameter” mean to AI researchers. Hidden variables are the opposite of “observed variables”. In Dustin Tran’s tutorials for Edward, he often represents observed variables by x and hidden variables by z. I will use \theta instead of z, so z=\theta below. The data consists of many samples of the observed variable x. The goal is to find a probability distribution for the hidden variables \theta. A hidden parameter is a special type of hidden variable. In the language of Bayesian networks, a hidden parameter corresponds to a root node (one without any parents) whose node probability distribution is a Kronecker delta function, so, in effect, the node only ever achieves one of its possible states.

Next, we compare algos that use back prop to the BBVI algo, assuming the simplest case of a single hidden parameter \theta (normally, there is more than one hidden parameter). We will assume \theta\in [0, 1]. In quantum neural nets, the hidden parameters are angles by which qubits are rotated. Such angles range over a closed interval, for example, [0, 2\pi]. After normalization of the angles, their ranges can be assumed, without loss of generality, to be [0, 1].

CASE1: Algorithms that use back prop.

Suppose \theta \in [0, 1],\;\;\eta > 0. Consider a cost function C and a model function M such that

C(\theta) = C(M(\theta)).

If we define the change d\theta in \theta by

d\theta = -\eta \frac{dC}{d\theta}= -\eta \frac{dC}{dM} \frac{dM}{d\theta},

then the corresponding change in the cost is

d C = d\theta \frac{dC}{d\theta} = -\eta \left( \frac{dC}{d\theta}\right)^2.

This change in the cost is negative, which is what one wants if one wants to minimize the cost.

CASE2: BBVI algo

Suppose \theta \in [0, 1],\;\;\eta > 0,\;\; \lambda > 0. Consider a reward function R (for BBVI, R = ELBO), a model function M, and a distance function dist(x, y)\geq 0 such that

R(\lambda) = R\left[\sum_\theta dist[M(\theta), P(\theta|\lambda)]\right].

In the last expression, P(\theta|\lambda) is a conditional probability distribution. More specifically, let us assume that P(\theta|\lambda) is the Beta distribution. Check out its Wikipedia article

https://en.wikipedia.org/wiki/Beta_distribution

The Beta distribution depends on two positive parameters \alpha, \beta (that is why it is called the Beta distribution). \alpha, \beta are often called concentrations. Below, we will use the notation

c_1 = \alpha > 0,

c_2 = \beta  > 0,

\lambda = (c_1, c_2).

Using this notation,

P(\theta|\lambda) = {\rm Beta}(\theta; c_1, c_2).

According to the Wikipedia article for the Beta distribution, the mean value of \theta is given in terms of its 2 concentrations by the simple expression

\langle\theta\rangle = \frac{c_1}{c_1 + c_2}.

The variance of \theta is given by a fairly simple expression of c_1 and c_2 too. Look it up in the Wikipedia article for the Beta distribution, if interested.

If we define the change dc_j in the two concentrations by

dc_j = \eta \frac{\partial R}{\partial c_j}

for j=1,2, then the change in the reward function R will be

dR = \sum_{j=1,2} dc_j \frac{\partial R}{\partial c_j}= \eta \sum_{j=1,2} \left(\frac{\partial R}{\partial c_j}\right)^2

This change in the reward is positive, which is what one wants if one wants to maximize the reward.

Comparison of CASE1 and CASE2

In CASE1, we need to calculate the derivative of the model M with respect to the hidden parameter \theta:

\frac{d}{d\theta}M(\theta).

In CASE2, we do not need to calculate any derivatives at all of the model M. (That is why it’s called a Black Box algo). We do have to calculate the derivative of P(\theta|\lambda) with respect to c_1 and c_2, but that can be done a priori since P(\theta|\lambda) is known a priori to be the Beta distribution:

\frac{d}{dc_j}\sum_\theta dist[M(\theta), P(\theta|\lambda)]= \sum_\theta \frac{d dist}{dP(\theta|\lambda)} \frac{dP(\theta|\lambda)}{dc_j}

So, in conclusion, in CASE1, we try to find the value of \theta directly. In CASE2, we try to find the parameters c_1 and c_2 which describe the distribution of \theta‘s. For an estimate of \theta, just use \langle \theta \rangle given above.

Leave a Comment »

No comments yet.

RSS feed for comments on this post. TrackBack URI

Leave a comment

Blog at WordPress.com.