In part 2 I focused on ML when your data won't fit in memory. This post will move on to slow, or compute bound ML instead. I'll continue to use Dask and explore how it can help us.
Dask leverages multiple CPU cores to enable efficient parallel computation on a single machine. It can also run on a thousand-machine cluster, breaking up computations and routing them efficiently. The Comparison to Spark documentation is a great reference for understanding Dask in the context of an older tool and that older post. Perhaps the most interesting difference is Spark just being an extension of the MapReduce paradigm and Dask being able to implement more sophisticated algorithms by being generic task scheduling-based.
Sticking with the scikit-learn examples, here is replacing a parallel algorithm's Joblib backend with Dask to (potentially) spread work out across a cluster.
The Dask documentation is quick to point out in their best practices that not everyone needs distributed ML as it has some overhead. Compiling code with Numba or Cython could help, as could intelligently sampling some of your data. In this post I got huge speedups by vectorizing some code that was looping through large matrices.
Across this three part series we've now seen how to speed up reading large datasets, work with datasets that don't fit entirely in memory, and distribute processing across multiple machines. There's obviously a lot more to this, but I wanted to develop a better intuition for how to approach these types of issues and at least know where to start. Hopefully you learned something too.
UPDATE:
A specific tool isn't supposed to be the focus of this post. Dask was used here to illustrate the idea and show how simple it can be, but there are other options in this space. Here are a couple of other examples that I've come across:
- tune-sklearn, which is part of the Ray framework for building distributed applications, can do distributed hyperparameter tuning
- Distributed training with TensorFlow shows how you can distribute model training within the TensorFlow ecosystem
UPDATE 2:
Both Pandas 2.0 and Polars now use Apache Arrow as a memory model. Polars, a relatively new entrant to this space, is implemented in Rust and exposes a Python API. It is created specifically for fast data processing, not ML, but overlaps enough with this series that it is worth checking out.