## Motivation

Many machine learning (ML) methods are built on probabilistic foundations, but training probabilistic models can be challenging. Training discrete probabilistic ML models is particularly difficult because algorithms cannot exploit gradient information. While TensorFlow offers convenient tools for learning with neural networks, users wishing to incorporate discrete probabilistic models into their models are offered little support.

This TensorFlow plugin makes it simple to integrate discrete probabilistic models with neural networks by providing a convenient interface to efficient state-of-the-art discrete sampling algorithms.

### Click Here to Get Started with QuPA

## An Example: Training Boltzmann Machines

The Boltzmann Machine (BM) is perhaps the best-known example of a discrete probabilistic model. BMs have been applied in many contexts for both generative and discriminative learning. Boltzmann machines define the probability of a vector of binary variables \(x\) as \(P(x) = \exp[-E(x)]/Z\). \(Z\) is the normalization constant defined to ensure the sum of probabilities over all possible \(x\) is 1. The energy function \(E(x)\) is often a quadratic function of \(x\).

Training of BMs (or models having BM components) requires expected values (under \(P(x)\)) of functions of \(x\). These expected values can be hard to approximate and Markov Chain Monte Carlo methods (e.g. Gibbs sampling) are commonly employed. Contrastive divergence (CD), which is based on Gibbs sampling, is the fastest (and crudest) method and has been used to train many ML models.

Compounding the difficulty of sampling during training is the fact that the distribution \(P(x)\) changes during training. Fortunately, the changes in \(P(x)\) during training are incremental and work done at a previous distribution can be leveraged for the updated distribution. Persistent contrastive divergence (PCD) exploits this observation to (in many cases) improve upon CD. Nevertheless, in a great many cases both CD and PCD are known to fail and cause instabilities during training. In fact, it is these difficulties that have led to a decline in the use of BMs as ML models. There are more sophisticated sampling algorithms that improve upon CD and PCD, but they can be difficult for non-experts to implement.

Most BMs used in machine learning have a particular restricted connectivity between variables which allows for certain simplifications during training. In what follows we consider training these restricted BMs (RBMs).

### Enter QuPA

QuPA (QUadrant Population Annealing) is a library that implements a variant of population annealing that adaptively tunes temperatures. The library integrates seamlessly into TensorFlow making the construction of models which combine RBMs and neural networks simple. QuPA is designed to exploit GPU parallelism, and currently requires a GPU.

The workflow to train a restricted BM is straight-forward (further details can be found here):

- Initialize a PopulationAnnealer object based on the structure of the graphical model:
- Define the negative log likelihood as the loss to be minimized: for BMs this is a combination of the free energy and a contribution from the logarithm of the normalization function, \(\log Z\). The \(\log Z\) contribution is the only complicating component that needs to be estimated with sampling by QuPA.
The \(\log Z\) component automatically defines appropriate gradients for training with TensorFlow optimizers.
- After the model is trained we evaluate the log likelihood on test data by again estimating \(\log Z\): This version of \(\log Z\) provides a more accurate estimate of \(\log Z\) at additional computational cost.

The population annealing algorithm used within QuPA reduces to PCD for one setting of parameters and a PCD implementation is available as part of QuPA. However, in most cases population annealing provides results that are better than PCD for the same amount of run time. Benchmarking results can be found here.

## Origins

QuPA grew out of work by the Quadrant team at D-Wave Systems. D-Wave builds quantum hardware and systems that accelerate the sampling operations required to train BMs. This population annealing code was written to benchmark quantum quantum hardware against the best-performing classical algorithm. QuPA wraps the population annealing code making it easy to call in ML workflows using TensorFlow.