I recently came across a somewhat large dataset from a Kaggle competition where the data was provided as an approximately 6 GB CSV file. A frequent comment in the discussion forum was how long it took just to read this file. A few GBs is large enough where this starts to become noticeable, but it's not really that big. If a laptop can have a 1+ TB drive and 32 GB memory then this isn't even in the realm of "big data". That's good, though, because it means there are some simple tricks we can use to cut down on that read time.
The pandas read_csv() method takes about 63 seconds for me. That's our baseline.
First we try reducing the precision. This gets us to 59 seconds. Not great.
Next we try reading the files in chunks. This is actually slower, but the technique could help us if the file didn't fit entirely in memory. More on that in future posts.
Then we try Dask which will spread the work across multiple processers. 30 seconds. Better, but I still don't want to wait that long. More on Dask in future posts as well.
Finally we convert the CSV file to a different format. I tried Apache Parquet but there are others. It's a binary columnar format (remember Column-oriented Database Basics?). Stored in this manner the data is 2.5 GB. And this gets us to just 3 seconds for reading the whole file!
Converting our data to the binary file format and possibly reducing the precision or using Dask as well would really shorten our feedback loop while training a ML model. It would seem that any cleaning of the data or preprocessing that we can do ahead of time would make sense to do once, before converting the file format, when the data is this size.
UPDATE:
Apache Arrow is another project to checkout in this space along with memory-mapped files. Reading this data in the Feather file format is even faster than Parquet.