Saturday, August 9, 2014

Vectorization: A Matrix Multiplication Example

It was in the Coursera Machine Learning class that I first learned about vectorization. When dealing with vectors and matrices, instead of writing loops and computing operations on scalars, you can instead just perform the operations directly on the higher dimensional data structures. It is a type of parallelism where one processor performs operations on multiple data simultaneously, but does not involve any concurrency.

Vectorization is made possible by array programming languages and libraries like Octave, R, and NumPy. It can be a compiler optimization or some kind of interface can be made available to the programmer so they can indicate the operations to vectorize. At a lower level, this is implemented with SIMD processor instructions or using, for example, 32-bit instructions to simulate vector computations of 16 or 8-bit types.

A simple example to demonstrate this, I think, is matrix multiplication. In Java (no vectorization possible without using the JNI) a simple (but partially optimized) implementation looks like this:

Multiplying 1000 x 1000 matrices containing random numbers takes about 6 seconds on my laptop. You can see how this will not work for machine learning calculations on matrices with millions of elements.

In Python, I can do the same multiplication using NumPy like this:

It takes about .02 seconds. Obviously this isn't a proper performance comparison, but the two order of magnitude difference is still illustrative of the power of vectorization.