ML at Scale Part 2: memory

Monday, February 15, 2021

ML at Scale Part 2: memory

When data we want to train a machine learning model on becomes too big to fit in memory, we need to find a way to work on subsets of the data. I hinted at this in part 1. We can use Pandas to read chunks of a file, but that is fairly primitive and slow.

Libraries like Vaex and Dask attempt to abstract this away.

Vaex provides lazy, out-of-core (not all in memory at once) DataFrames via memory-mapping. Pre-processing and feature engineering are more efficient, and memory is freed up for model training. It also has a vaex.ml package which provides a scikit-learn wrapper.

Dask provides large, parallel DataFrames composed of smaller Pandas DataFrames. This helps with data too big to fit in memory because the individual Pandas DataFrames can be stored on disk. DaskML provides estimators designed to work with Dask DataFrames.

The accuracies both came out to 96%. Similar ideas, different implementation. In fact, these DataFrames remind me a little bit of the persistent data structures covered in my Exploring Immutability post.

Even with these fancy DataFrames, many machine learning algorithms are designed train on all the data at once. If our data is too big to fit in memory then that's going to be a problem. As in the code above, we need to use online learning or incremental algorithms to solve this problem. The Incremental and IncrementalPredictor classes handle "streaming" the data (in batches) and we specify upfront the possible classes.

DaskML adds several generalized linear model implementations. scikit-multiflow is designed for actual streaming data and adds several other online learning algorithms. Neural networks are also trained in this manner, often being fed "mini-batches", so they are good candidates for datasets that don't fit in memory as well.

Stay tuned for part 3 where I'll get into being compute-constrained instead of memory-constrained.

UPDATE:

There's a lot of I/O and memory-related work coming out of the TensorFlow community as well. Checkout Better performance with the tf.data API as it overlaps nicely with Parts 1 and 2 of this series.

programming opiethehokie

Pages

Monday, February 15, 2021

ML at Scale Part 2: memory