⁂ George Ho

Benchmarks for Mass Matrix Adaptation

I was lucky enough to be invited to attend the Gradient Retreat earlier this month. It was an entire week on a beautiful island with some amazingly intelligent Bayesians, and no demands on my time other than the self-set (and admittedly vague) goal of contributing to probabilistic programming in some way.

I initially tried to implement mass matrix adaptation in Tensorflow Probability, but I quickly readjusted my goals to something more achievable: running some benchmarks with tuning in Hamiltonian Monte Carlo (HMC).

A view of a forest on Galiano Island The view from a bluff on Galiano Island
Pictures from Galiano Island.

A quick rundown for those unfamiliar: tuning is what happens before sampling, during which the goal is not to actually draw samples, but to prepare to draw samples1. For HMC and its variants, this means estimating HMC parameters such as the step size, integration time and mass matrix2, the last of which is basically the covariance matrix of the model parameters. Because my life is finite (and I assume everybody else’s is too), I limited myself to mass matrix adaptation.

(If you’re still uncertain about the details of tuning or mass matrix adaptation, check out Colin Carroll’s essay on HMC tuning or the Stan reference manual on HMC parameters: I don’t explain many more concepts in the rest of this post.)

The interesting thing about tuning is that there are no rules: there are no asymptotic guarantees we can rely on and no mathematical results to which we can turn for enlightened inspiration. The only thing we care about is obtaining a decent estimate of the mass matrix, and preferably quickly.

Accompanying this lack of understanding of mass matrix adaptation is an commensurate lack of (apparent) scientific inquiry — there is scant literature to look to, and for open source developers, there is little prior art to draw from when writing new implementations of HMC!

So I decided to do some empirical legwork and benchmark various methods of mass matrix adaptation. Here are the questions I was interested in answering:

  1. Is the assumption that the mass matrix is diagonal (in other words, assume that all parameters are uncorrelated) a good assumption to make? What are the implications of this assumption for the tuning time, and the number of effective samples per second?
  2. Does the tuning schedule (i.e. the sizes of the adaptation windows) make a big difference? Specifically, should we have a schedule of constant adaptation windows, or an “expanding schedule” of exponentially growing adaptation windows?
  3. Besides assuming the mass matrix is diagonal, are there any other ways of simplifying mass matrix adaptation? For example, could we approximate the mass matrix as low rank?

I benchmarked five different mass matrix adaptation methods:

  1. A diagonal mass matrix (diag)
  2. A full (a.k.a. dense) mass matrix (full)
  3. A diagonal mass matrix adapted on an expanding schedule (diag_exp)
  4. A full mass matrix adapted on an expanding schedule (diag_exp)
  5. A low-rank approximation to the mass matrix using Adrian Seyboldt’s covadapt library.

I benchmarked these adaptation methods against six models:

  1. A 100-dimensional multivariate normal with a non-diagonal covariance matrix (mvnormal)
  2. A 100-dimensional multivariate normal with a low-rank covariance matrix (lrnormal)
  3. A stochastic volatility model (stoch_vol)
  4. The eight schools model (eight)
  5. The PyMC3 baseball model (baseball)
  6. A sparse Gaussian process approximation (gp)

Without further ado, the main results are shown below. Afterwards, I make some general observations on the benchmarks, and finally I describe various shortcomings of my experimental setup (which, if I were more optimistic, I would call “directions for further work”).

Tuning Times

This tabulates the tuning time, in seconds, of each adaptation method for each model. Lower is better. The lowest tuning time for each model is shown in bold italics.

mvnormallrnormalstoch_volgpeightbaseball
diag365.34340.10239.5918.472.925.32
full8.29364.07904.9514.242.914.93
diag_exp358.50360.91219.6516.253.055.08
full_exp8.46142.20686.5814.873.216.04
covadapt386.1389.92398.08N/AN/AN/A

Effective Samples per Second

This tabulates the number of effective samples drawn by each adaptation method for each model. Higher is better. The highest numbers of effective samples per second is shown in bold italics.

mvnormallrnormalstoch_volgpeightbaseball
diag0.021.5511.2265.36761.82455.23
full1.730.016.71106.30840.77495.93
diag_exp0.021.519.7959.89640.90336.71
full_exp1,799.111,753.650.16101.99618.28360.14
covadapt0.02693.875.71N/AN/AN/A

Observations

tldr: As is typical with these sorts of things, no one adaptation method uniformly outperforms the others.

Experimental Setup

Shortcomings


  1. It’s good to point out that mass matrix adaptation is to make sampling more efficient, not more valid. Theoretically, any mass matrix would work, but a good one (i.e. a good estimate of the covariance matrix of the model parameters) could sample orders of magnitudes more efficiently. ↩︎

  2. …uh, sweats and looks around nervously for differential geometers more formally called the metric… ↩︎

  3. There are some violin plots lying around in the notebook, a relic from a time when I thought that I would have the patience to run each model and adaptation method multiple times. ↩︎

#Open-Source #Pymc