Background

The probabilistic approach to machine learning (ML) represents uncertainty both within the learned model (by learning a probability distribution), and during model training (model parameters are treated as random variables). A probabilistic approach conveys a number of benefits:

  • the rules of probability theory are used to propagate uncertainty and to make inferences in a principled manner,
  • background knowledge is captured in prior distributions over model parameters,
  • we can learn from noisy, missing, and/or unlabeled data,
  • we can estimate confidence levels and represent ambiguity in predictions (any input may map to multiple predictions with differing probabilies).

These features allow for optimal decision making in the absence of complete information.

In spite of these benefits, much of the recent transformation in ML has been driven by deterministic feed-forward neural networks which are computationally simpler. Stochasticity is usually introduced by using neural networks to encode the parameters of standard distributions such as Gaussian (for continuous variables) or Bernoulli (for binary variables). However, we would often like to combine neural networks with richer classes of probability distributions. Graphical models provide a rich class of probability models for discrete random variables and in many cases still allow for tractable inference. Undirected graphical models (Markov random fields) are particularly appealing as they arise naturally in spatial and relational domains, they allow for direct inference of any subset of variables given any evidence (values of observed variables), and undirected discriminative models (conditional random fields) often perform better than directed discriminative models (because information only propagates forwards along directed edges).

However, undirected graphical models are typically more expensive to train as inference in undirected models can be difficult. The QuPA library mitigates these difficulties by integrating with TensorFlow and by providing simple interfaces to train models which combine neural networks with Boltzmann machines. Boltzmann machines are a particular type of discrete undirected graphical model that uses binary valued variables with up to pairwise interactions between variables. Any discrete graphical model can be converted to a Boltzmann machine by introducing extra variables.

In this document we provide background on training models with Boltzmann machine components and describe the programmer interface in detail.

Boltzmann Machines

Let \(\vc{x}\) represents a vector of \(N\) binary random variables. with components \(\vc{x}^T = [x_1, x_2, \dots, x_N]\), where \(x_n \in \{0, 1\}\)..

The Boltzmann distribution defines a probability for each possible state that \(\vc{x}\) can take using

\begin{equation} p(\vc{x} | \vc{\theta}) = \frac{1}{Z(\vc{\theta})} \exp\bigl(-E(\vc{x} | \vc{\theta})\bigr) \end{equation}

where \(E(\vc{x} | \vc{\theta})\) is an energy function parameterized by \(\vc{\theta}\), and

\begin{equation} Z(\vc{\theta}) = \sum_{\vc{x}} \exp\bigl(-E(\vc{x} | \vc{\theta})\bigr) \end{equation}

is the partition function, that ensures that \(p(\vc{x}; \vc{\theta})\) sums to 1 over all the possible states of \(\vc{x}\); that is,

\begin{equation} \sum_{\vc{x}} p(\vc{x} | \vc{\theta}) = 1. \end{equation}

Note that because of the negative sign for energy, \(E\), the states with high probability correspond to states with low energy.

Most commonly, the energy function \(E(\vc{x} | \vc{\theta})\) includes linear and pairwise interactions, but has no higher-order interactions. Any energy function can be made quadratic so this assumption does not imply any loss in generality. Quadratic energy functions may be written as [1]

[1]We adopt a convention common in machine learning where the model parameters \(\theta\) define the negative energy.
\begin{equation} E(\vc{x} | \vc{\theta}) = -\vc{x}^T \vc{Q} \vc{x} = -\sum_{i,j} x_i Q_{i,j} x_j \end{equation}

Note

The terms linear in \(x_i\) arise from the diagonal elements \(Q_{i,i}\) since \(x_i^2 = x_i\) when \(x_i \in \{0,1\}\).

Here, \(\vc{\theta}\) collectively represents the linear and pairwise parameters captured in the \(N\times N\) matrix \(\vc{Q}\). The off-diagonal entries of \(\vc{Q}\) represent the weights coupling the elements of \(\vc{x}\). If \(Q_{i,j}\) is large (a positive value with a large magnitude), then \(x_i\) and \(x_j\) are more likely to be 1 at the same time. On the other hand, the diagonal entries of \(\vc{Q}\) bias the probability of individual binary variables in \(\vc{x}\). For example, if \(Q_{i,i}\) is negative and large in magnitude, then \(x_i\) is very likely to be zero.

Restricted Boltzmann Machines

A restricted Boltzmann machine (RBM) is a special type of Boltzmann machine with a bipartite structure; see Fig. 1. An RBM defines a probability distribution over a vector of binary variables that are divided into visible (input), \(\vc{v}\), and hidden, \(\vc{h}\), variables. The hidden variables allow for more complex dependencies among visible variables and are often used to learn a stochastic generative model over a set of inputs. All visible variables connect to all hidden variables, but no variables within the same layer are linked. As we will see, this bipartite connectivity makes inference simpler and therefore parameter-learning is easier.

Two-layer neural net comprising a layer of visible units and one of hidden units. Visible units are numbered V 0 through V 3. Hidden units are labeled H 0 through H 2. There are connections between the visible and hidden units, but none between units in the same layer .

Fig. 1 Bipartite neural net, with a layer of 4 visible variables \(\vc{v}\) connected to a layer of 3 hidden variables \(\vc{h}\).

If we write \(\vc{x}=[\vc{v},\vc{h}]^T\) then the energy function of an RBM is

\begin{equation} E(\vc{x}| \vc{\theta}) = -\vc{b}^T \vc{v} - \vc{v}^T \vc{W} \vc{h} - \vc{c}^T \vc{h} \end{equation}

where \(\vc{b}\) and \(\vc{c}\) are the linear biases acting on the visible and hidden variables respectively, and \(\vc{W}\) is the weight matrix coupling visible and hidden variables. The parameters in this case are written as \(\vc{\theta} = [\vc{b},\vc{c},\vect(\vc{W})]\) where \(\vect(\vc{W})\) is the vector obtained from \(\vc{W}\) by stacking it’s columns . The number of visible and hidden variables can be set independently, and more hidden variables allow for richer interactions among visible variables. For example, images from the MNIST dataset of handwritten digits[2] have 784 pixels, so the RBMs that are training from this dataset require \(N=784\) visible variables. Good RBM models of the digits often use 500 hidden variables.

[2]http://yann.lecun.com/exdb/mnist/

Without the introduction of hidden variables, the energy function \(E(\vc{x} | \vc{\theta})\) by itself is not sufficiently flexible to give good models. The joint distribution over visible and hidden variables is given by

\begin{equation} p(\vc{v}, \vc{h} | \vc{\theta}) = \frac{p(\vc{v},\vc{h} | \vc{\theta})}{Z(\vc{\theta})}, \end{equation}

and we are interested in

\begin{equation} p(\vc{v} | \vc{\theta}) = \sum_\vc{h} p(\vc{v},\vc{h} | \vc{\theta}), \end{equation}

which we obtain by marginalizing over the hidden variables, \(\vc{h}\). As a mixture distribution, \(p(\vc{v} | \vc{\theta})\) is quite flexible.

Training

A standard training criterion used to determine the energy function is to maximize the log likelihood (LL) of the training data — or, equivalently, to minimize the negative log likelihood (NLL) of the data.

We assume that we are given \(D\) training (visible) examples \(\vc{v}^{(1)}, \cdots, \vc{v}^{(D)}\), and would like to find a setting for \(\vc{\theta}\) under which this data is highly likely. We denote the \(n^{th}\) component of the \(d^{th}\) training example as \(v_n^{(d)}\).

To find \(\vc{\theta}\), we maximize the likelihood of the training data: [3]

[3]State-of-the-art RBM results regularize \(\vc{\theta}\) using a prior \(P(\vc{\theta})\). Often \(\ln P(\vc{\theta}) \propto -\lambda \| \vc{\theta} \|^2/2\) which biases the components of \(\vc{\theta}\) towards smaller values [Hinton]. The unknown scalar parameter \(\lambda\) is set by cross validation.
  • The likelihood is \(L(\vc{\theta}) = \prod_{d=1}^D p(\vc{v}^{(d)} | \vc{\theta})\)
  • It is more convenient, computationally, to maximize the log-likelihood:
\begin{equation} LL(\vc{\theta}) = \ln L(\theta) = \frac{1}{D} \sum_{d=1}^D \ln p(\vc{v}^{(d)} |\vc{\theta}). \end{equation}

We use the gradient descent method to minimize the negative log-likelihood \(NLL(\vc{\theta})\):

  • Starting at an initial guess for \(\vc{\theta}\) (effective initializations are discussed in [Hinton]) we calculate the gradient (the direction of steepest descent) and then take a step in that direction.
  • We then iterate by taking the gradient at the new point and moving downhill again.

With a short calculation the gradient \(\vc{\nabla}_{\theta} NLL(\vc{\theta})\) is found to be

\begin{equation} \vc{\nabla}_\theta NLL(\vc{\theta}) = \mathbb{E}_{p(\vc{v},\vc{h}|\vc{\theta})} \bigl[ \vc{\nabla}_\theta E(\vc{v},\vc{h} | \vc{\theta}) \bigr] - \frac{1}{D} \sum_{d=1}^D \mathbb{E}_{P(\vc{h}|\vc{v}^{(d)},\vc{\theta})} \bigl[ \vc{\nabla}_\theta E(\vc{v}^{(d)},\vc{h} | \vc{\theta}) \bigr]. \end{equation}

To evaluate this gradient at a particular \(\vc{\theta}\), we must determine the expected values of the components of \(\vc{\nabla}_\theta E(\vc{v},\vc{h} | \vc{\theta})\) with respect to \(p(\vc{v},\vc{h} | \vc{\theta})\) and the conditional posterior distribution \(p(\vc{h}|\vc{v},\vc{\theta})\). For the bipartite RBM architecture the expectation can be determined in closed form. However, the first expectation must be approximated. Most approximations rely on some form of Monte Carlo sampling. The Persistent Contrastive Divergence (PCD) method [Tieleman] relies on short Monte Carlo chains seeded from samples obtained at the previous \(\vc{\theta}\). PCD is the most common approximation, but it can break down when small changes in \(\vc{\theta}\) cause large changes in the resultant \(p(\vc{v},\vc{h} | \vc{\theta})\) (perhaps because new modes in the distribution arise). Such sensitive changes are common during training and have been extensively studied in physics as phase transitions.

Physicists have developed improved sampling methods appropriate for phase transitions. One such method is called population annealing [Machta] and a variant of this method (QuPA) is available in this library. Details on the algorithm and some of the benefits of learning with population annealing are available in the accompanying technical report. The population annealing algorithm is an extension of PCD, and for one choice of parameters reduces to PCD. Thus, with this library training models with traditional PCD is also simple. A detailed description of the algorithm used in QuPA and benchmark results can be found in [QuPA]

The QuPA interface has been designed to make the training of probabilistic models straightforward. In the next section we provide a detailed walk through of training with QuPA.

Evaluation

Evaluating the performance of trained probabilistic models is often very difficult as it requires determination of the partition function of the model. To simplify evaluation the library also supplied routines (again based on population annealing) to estimate the logarithm of the partition function. The example provides more details.

[Hinton](1, 2) G. Hinton, A Practical Guide to Training Restricted Boltzmann Machines, Neural Networks: Tricks of the Trade. Lecture Notes in Computer Science, vol 7700. Springer, (2012).
[Tieleman]T. Tieleman. Training restricted boltzmann machines using approximations to the likelihood gradient, Proceedings of the 25th International Conference on Machine Learning, pages 1064–1071. ACM, (2008).
[Machta]J. Machta, Population annealing with weighted averages: A Monte Carlo method for rough free-energy landscapes, Phys. Rev. E 82, 026704 (2010).
[QuPA]https://quadrant.ai/s/QuPABenchmark.pdf