Sunday, October 23, 2022

Java Updates Versions 9-19

The past few years I haven't written much Java code and when I did, it was Java 8. Many projects, it seems, have stuck with Java 8 which was released back in 2014. Per the roadmap, Java 8 is designated as LTS, but so are Java 11 and Java 17. In fact, Java 19 is available as of last month and many interesting features have been introduced in the past 8 years. This post is an overview of what's changed. The highlights, in my opinion, so we're up-to-date. I think it's enough to be interesting but not so much that it can't be picked up quickly if you have experience with older Java versions.

First, some name conventions. Java EE is now Jakarta EE. Definitely don't call it J2EE anymore. And since Java 11 Oracle JDK and OpenJDK are basically the same.

Tooling Updates

I won't go into too much detail here. If we are interested in using any of these they are explained and documented well elsewhere. For awareness:

API Updates

We want to start incorporating these into our code where applicable, so I created examples to help get used to some of these updates.

Private interface methods (9). Helps to encapsulate code in default methods and create more reusable code.

Variables can have implicit types, including in lambdas, to reduce the verbosity of code (10 and 11).

Switch expressions to simplify code and prepare for pattern matching in the future (12).

Text blocks as way to simplify code with multi-line strings (13).

Simple pattern matching (14). I was introduced to pattern matching when programming in Scala and this seems to continue a trend of Scala features making their way, in some form, to Java. It looks like more is coming in terms of pattern matching options.

Record keyword for immutable data classes (14). Getters, a public constructor, plus equals, hashCode, and toString methods are generated automatically. Lombok is still more flexible, but this is nice for simple cases.

Sealed classes for fine-grained inheritance control (15). Super-classes that are widely accessible but not widely extensible.

Paradigm Updates

For lack of a better name I'll call these paradigm updates as they relate more to programming models. 

Flow API as an implementation of the Reactive Streams Specification (9). This seems to be a way to get the specification interfaces into the JDK but not necessarily replace libraries with better implementations.

Vector API introduces vectorization to Java (16). It looks like it doesn't happen automatically and requires special code, so to be more useful I think it needs to be included in a common library like NumPy in Python.

Virtual threads and structured concurrency (19). The one-to-one kernel to user thread mapping is broken enabling easier asynchronous programming. Read/watch Project Loom: Revolution in Java Concurrency or Obscure Implementation Detail? The tl;dr is we'll still need a higher level of abstraction like reactive programming unless you want to relearn all the low-level concurrency structures.

Friday, April 15, 2022

Probabilistic Graphical Models

Deep learning and neural networks get a lot of (deserved) attention, but there is another class of ML models called Probabilistic Graphical Models (PGMs) that can also be used for inference and prediction. They have applications in fields such as medical diagnosis, image understanding, and speech recognition. Think decision making based on incomplete or insufficient knowledge.

More formally, PGMs use graphs to encode joint probability distributions as opposed to the more traditional ML approach of learning a function that directly maps input to a target variable. This post isn't a technical introduction though. Rather, it is more of an introduction-by-example and a summary of pgmpy's excellent notebooks.

Given a simple graph for flower type:

Our two approaches would look something like this:

Bayesian networks

In this section I'll use a more complex graph for student grades:

For problems with many features and/or high cardinality features, inference will be difficult because the size of the joint probability distribution increases exponentially. PGMs can compactly represent it by exploiting conditional independence. They provide us efficient methods for doing inference over these joint distributions.

In this graph we have cardinalities of 2 for each node except Letter which is 3. The joint distribution would require storing 48 values (2*2*2*2*3) while the PGM only requires 26 (see notebook 1 for details).

This is what's known as a Bayesian network, which is always represented as a directed acyclic graph. Each node is parameterized by a conditional probability distribution (CPD) like P(node|parents). For example, the Grade node has the CPD P(G|D,I). Bayesian networks are used when you want to represent causal relationships between random variables. Naive Bayes is a special case where all random variables are assumed to be independent of each other, each only directly affecting the target variable.

Given tabular data and a graph structure, CPDs can be estimated using Maximum Likelihood Estimation (MLE). It's similar to what was done with the Iris data in the first code block above. It's also fragile because it is so dependent on the amount and quality of the observed data (see notebook 10 for details). This explains why that code breaks with some random seeds. 

A better solution is Bayesian Parameter Estimation. There you start with CPDs based on your prior beliefs (or uniform priors) and update them based on the observed data.

One method of exact inference in PGMs is variable elimination. It efficiently avoids computing the entire joint probability distribution (see notebooks 2 and 5 for details). For larger graphs there are other, approximate algorithms because an exact solution would be intractable.

Making predictions is similar. Instead of getting a distribution we get the most probably state.

Markov networks

Markov networks are represented by undirected graphs. They represent non-causal relationships. They can, however, represent dependencies that a Bayesian model can't, like cycles and bi-directional dependencies. Factors describe connected variable affinity, or how much two nodes agree with each other. The joint probability distribution is the product of all factors.

A quick note because the names sound similar. Markov chains are not PGMs because the nodes are not random variables. They can be represented as as Bayesian networks and PGM algorithms would be available.

Sampling

Sampling algorithms approximate exact inference by generating a large number of samples that will converge to the original distribution. One of these is Hamiltonian Monte Carlo. It is a Markov Chain Monte Carlo (MCMC) that proposes future states in the Markov Chain using Hamilton dynamics from physics (see notebook 8 for details). Other MCMC algorithms you may encounter are Metropolis-Hastings and Gibb's Sampling. See Monte Carlo Approximation Methods: Which one should you choose and when? for a comparison of these methods.

Another interesting find that fits in at this point is the PyMC3 library and the Probabilistic Programming and Bayesian Methods for Hackers open source book. 

I also think this is a nice writeup on Bayesian Logistic Regression using Pyro, another probabilistic programming library, and MCMC.

Learning networks

Learning a Bayesian network can be done as an optimization problem by scoring networks on how well they fit a data set, and searching through the space of all possible models. For non-trivial graphs where an exhaustive search is not possible, hill climbing can be used (see notebook 11 for details).

Wrap-up

I've only scratched the surface here, but I think it's a more intuitive introduction to the topic than most of the material in this space. We could build up to more complex graphs and problems from here.


And to bring things full circle on where PGMs fit in to the ML landscape, here is an opinion from well-known ML researcher Ian Goodfellow:
The two aren’t mutually exclusive. Most applications of neural nets can be considered graphical models that use neural nets to provide some of the conditional probability distributions. You could argue that the graphical model perspective is growing less useful because so many recent neural models have such simple graph structure  These graphs are not very structured compared to neural models that were popular a few years ago like … But there are some recent models that make a little bit of use of graph structure, like VAEs with auxiliary variables.

Plus a tweet from the Standford NLP group:

Thus is would seem that knowing these concepts will continue to be useful even if we don't directly use PGMs or focus solely on PGMs.