Saturday, June 13, 2020

Column-oriented Database Basics

This is a short post on column-oriented databases. I'll barely scratch the surface, but among the types of NoSQL databases—document, key-value, column-oriented and graph—I've always thought column-oriented was the most difficult to wrap my head around. Hopefully we can get past that initial hurdle here, run a few queries, and they will seem like less of a mystery.

A relational database is optimized for retrieving rows of data. This works well for transactional applications. A column-oriented database is optimized for retrieving columns of data. This works well for analytical applications and some queries, like aggregations, become really fast because much less data needs to be read from disk to retrieve the whole column. There seems to be a lot of overlap between column-oriented databases and data warehouses.

A relational database would store 3 rows of data like this:

1:a,b,c;2:d,e,f;3:g,h,i

While a column-oriented database would store the same data like this:

a:1,d:2,g:3;b:1,e:2,h:3;c:1,f:2,i:3

For a low-cardinality column, compression algorithms work very well. Something like a:2,a:3 becomes a:2,3. In some ways this is like normalizing a relational database to reduce data duplication, but I don't think that enables the same level of compression or gives you the same data locality benefits.

Of course column-oriented databases aren't good for all workloads. They aren't optimized for queries that touch many fields. Writes can also be slow since they aren't just appending to the end of a file.

So where do they fit into a big data architecture? When thinking about this I remembered the article How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh. It avoids mentioning specific products, but they way I interpret it is that their first generation was dedicated data warehouse products where you did ETL. The second generation was more ELT, maybe using Hadoop and Spark. The third generation, the eponymous data mesh, unifies batch and stream processing, perhaps adding Kafka to the above mix. Using the CQRS pattern, for example, a column-oriented database could fulfill some of the Q (query) operations as a sort of data warehouse.

For my first column-oriented database adventure, after that background research, I chose Amazon Redshift and I followed their Getting Started with Amazon Redshift guide. It only took about an hour to spin up a Redshift cluster, upload their sample data to an S3 bucket, copy that into Redshift, then run a few SQL queries.

The big surprise and takeaway was that this felt just like using a relational database. The data I uploaded was in text files (basically CSV files) and was used to create tables. The queries I ran were SQL queries that joined tables, selected different fields and aggregated the results. The details of how the database works under the covers is interesting, but doesn't inform its usage. There isn't much mystery after all.

SELECT * FROM event;









SELECT firstname, lastname, total_quantity FROM (SELECT buyerid, sum(qtysold) total_quantity FROM sales GROUP BY buyerid ORDER BY total_quantity desc limit 10) Q, users WHERE Q.buyerid = userid ORDER BY Q.total_quantity desc;

















A couple of other column-oriented databases you may have heard of are Cassandra and HBase. NoSQL databases are built to scale horizontally so you also need to consider how the CAP Theorem applies to your situation when choosing among them as they each make different trade-offs in addition to their unique features. The best choice will be highly data and workload specific.