Friday, April 15, 2022

Probabilistic Graphical Models

Deep learning and neural networks get a lot of (deserved) attention, but there is another class of ML models called Probabilistic Graphical Models (PGMs) that can also be used for inference and prediction. They have applications in fields such as medical diagnosis, image understanding, and speech recognition. Think decision making based on incomplete or insufficient knowledge.

More formally, PGMs use graphs to encode joint probability distributions as opposed to the more traditional ML approach of learning a function that directly maps input to a target variable. This post isn't a technical introduction though. Rather, it is more of an introduction-by-example and a summary of pgmpy's excellent notebooks.

Given a simple graph for flower type:

Our two approaches would look something like this:

Bayesian networks

In this section I'll use a more complex graph for student grades:

For problems with many features and/or high cardinality features, inference will be difficult because the size of the joint probability distribution increases exponentially. PGMs can compactly represent it by exploiting conditional independence. They provide us efficient methods for doing inference over these joint distributions.

In this graph we have cardinalities of 2 for each node except Letter which is 3. The joint distribution would require storing 48 values (2*2*2*2*3) while the PGM only requires 26 (see notebook 1 for details).

This is what's known as a Bayesian network, which is always represented as a directed acyclic graph. Each node is parameterized by a conditional probability distribution (CPD) like P(node|parents). For example, the Grade node has the CPD P(G|D,I). Bayesian networks are used when you want to represent causal relationships between random variables. Naive Bayes is a special case where all random variables are assumed to be independent of each other, each only directly affecting the target variable.

Given tabular data and a graph structure, CPDs can be estimated using Maximum Likelihood Estimation (MLE). It's similar to what was done with the Iris data in the first code block above. It's also fragile because it is so dependent on the amount and quality of the observed data (see notebook 10 for details). This explains why that code breaks with some random seeds. 

A better solution is Bayesian Parameter Estimation. There you start with CPDs based on your prior beliefs (or uniform priors) and update them based on the observed data.

One method of exact inference in PGMs is variable elimination. It efficiently avoids computing the entire joint probability distribution (see notebooks 2 and 5 for details). For larger graphs there are other, approximate algorithms because an exact solution would be intractable.

Making predictions is similar. Instead of getting a distribution we get the most probably state.

Markov networks

Markov networks are represented by undirected graphs. They represent non-causal relationships. They can, however, represent dependencies that a Bayesian model can't, like cycles and bi-directional dependencies. Factors describe connected variable affinity, or how much two nodes agree with each other. The joint probability distribution is the product of all factors.

A quick note because the names sound similar. Markov chains are not PGMs because the nodes are not random variables. They can be represented as as Bayesian networks and PGM algorithms would be available.

Sampling

Sampling algorithms approximate exact inference by generating a large number of samples that will converge to the original distribution. One of these is Hamiltonian Monte Carlo. It is a Markov Chain Monte Carlo (MCMC) that proposes future states in the Markov Chain using Hamilton dynamics from physics (see notebook 8 for details). Other MCMC algorithms you may encounter are Metropolis-Hastings and Gibb's Sampling. See Monte Carlo Approximation Methods: Which one should you choose and when? for a comparison of these methods.

Another interesting find that fits in at this point is the PyMC3 library and the Probabilistic Programming and Bayesian Methods for Hackers open source book. 

I also think this is a nice writeup on Bayesian Logistic Regression using Pyro, another probabilistic programming library, and MCMC.

Learning networks

Learning a Bayesian network can be done as an optimization problem by scoring networks on how well they fit a data set, and searching through the space of all possible models. For non-trivial graphs where an exhaustive search is not possible, hill climbing can be used (see notebook 11 for details).

Wrap-up

I've only scratched the surface here, but I think it's a more intuitive introduction to the topic than most of the material in this space. We could build up to more complex graphs and problems from here.


And to bring things full circle on where PGMs fit in to the ML landscape, here is an opinion from well-known ML researcher Ian Goodfellow:
The two aren’t mutually exclusive. Most applications of neural nets can be considered graphical models that use neural nets to provide some of the conditional probability distributions. You could argue that the graphical model perspective is growing less useful because so many recent neural models have such simple graph structure  These graphs are not very structured compared to neural models that were popular a few years ago like … But there are some recent models that make a little bit of use of graph structure, like VAEs with auxiliary variables.

Plus a tweet from the Standford NLP group:

Thus is would seem that knowing these concepts will continue to be useful even if we don't directly use PGMs or focus solely on PGMs.