PyMC 3.7: Making Data a First-Class Citizen
Posted by Chris Fonnesbeck on behalf of the PyMC development team
This week featured the release of PyMC 3.7, which includes a slew of bug fixes and enhancements to help make building and fitting Bayesian models easier and more robust than ever. No fewer than 43 developers committed changes that became part of this release, so a big thanks goes out to all of them for their contributions.
The number of new features in 3.7 is modest, but one in particular merits highlighting. Juan Martin Loyola generalized the Minibatch
class, a wrapper for datasets that allows for stochastic gradient calculations in variational inference, to more formally integrate datasets into model specification. Specifically, the Data
class is a container for data that endows it with many of the attributes of other PyMC3 variable objects. What this does is formally incorporate your data into the model graph, as a deterministic node.
Behind the scenes, Data
converts data arrays into Theano `shared` variables which. A shared variable, as the name implies, is a variable whose value is shared between functions in which it is used. This allows the values to be changed between model runs, without having to re-specify the model itself. Note that this is exactly how Minibatch
works, though for Minibatch
the values are changed within a run of variational inference fitting. There are two general scenarios where using the Data
wrapper will be helpful: making predictions on a new subset of data after model fitting, and running the same model on multiple datasets.
Let’s look at a quick example to see how it works.
Here’s a simple logistic regression model for analyzing baseball data that tries to estimate the probability that a batter swings and misses at a curveball as a function of the rate at which the ball spins (it is thought that more effective pitches are related to higher spin rates).
curveball_data.head()
We can wrap the spin rate column in Data
and give it a name (let's call it "spin") and do the same for the binary outcome ("miss"). We then reference these objects in the model where we otherwise would have passed NumPy array
or pandas Series
objects.
Notice the positive estimate for β[1], which suggests higher spin rates lead to more swinging strikes.
>>> az.summary(trace, var_names=['β'])
Now, let’s say we have pitchers with known spin rates on their curve balls. We can use sample_posterior_predictive
to predict the associated miss rate, but first we use set_data
within the model context to swap in the new data. Nothing else needs to change!
So, it looks like increasing spin rate from 1810 rpm to 3015 rpm could result in a miss rate 10 percentage points higher!
>>> post_pred['miss'].mean(0)array([0.379, 0.428, 0.481])
Additionally, set_data
can be used to pass batches of independent data to models for sequential fitting. Let's extend the above example to the case where there are multiple years of data:
We might then want separate, annual estimates of curveball effectiveness as a function of spin. This is a matter of looping over the years, and passing the subsets to the model prior to each fit.
Pretty slick.
There are numerous other smaller changes in PyMC 3.7. Of note, following much discussion the sd
argument has been renamed to sigma
for scale parameters. There is conflicting opinion regarding the use of Greek variable names for distribution parameters, but at least our convention is now consistent, in that we use mu
and sigma
rather than mu
and sd
(though sd
will still be accepted in order to retain backward-compatibility).
Another nice enhancement is a fix to the Mixture
class that allows it to work properly with multidimensional or multivariate distributions. In previous releases, model fitting would often work, but any sort of predictive sampling would fail.
If you are a fan of the GLM submodule, you will be happy to learn that it is now a little more flexible. Any model strings passed to from_formula
can now extract variables from calling scope. For example:
Prior to this release, all of the data had to be shoehorned into the DataFrame
passed to from_formula
.
A complete list of changes can be found on our GitHub repository. The 3.7 release is recommended for all users. You can install it now, either via pip
:
pip install -U pymc3
or conda
, if you are using Anaconda's Python distribtuion:
conda install -c conda-forge pymc3
Thanks once again to the entire PyMC3 development team, and to diligent users who reported (and sometimes fixed) bugs. Happy sampling!