Pages

Monday, January 15, 2024

CUDA Kernels: Speeding up Matrix Multiplication

The Python ecosystem benefits greatly from be able to use libraries written in C/C++ and Rust (see my previous post) to increase performance. Increasingly, though, I've been running code on GPUs instead. Libraries like PyTorch simplify this, but how do they do it? Previously I could only answer something like "it's CUDA" without knowing what that really means. In this post we'll dive deeper and see what it takes to create our own CUDA kernel. We'll (re-)implement matrix multiplication, run it on a GPU, and compare to numpy and PyTorch performance.









First, what is CUDA? CUDA is a general purpose parallel computing platform and API. It gives direct access to NVIDIA GPU's virtual instruction set and parallel computation. It works with C, C++, Fortran, has Java and Python wrappers, and is supposed to be easier to use than earlier APIs like OpenCL. CUDA allows us to execute kernels which are functions compiled for GPUs, separately from the main program. This is how the majority of deep learning taking place today functions.

Second, what makes GPUs so special? GPUs are generally slower than CPUs at one operation but excel at operations that can be parallelized. They can have thousands of cores and higher memory bandwidth. If we want to tackle more than embarrassingly parallel problems, like matrix multiplication, this means we need a parallel algorithm to make use of the GPU.

Third, how do we access it from Python? It turns out there are several options, or several Python wrappers for this. I chose to go with Numba, a just-in-time (JIT) compiler that can target both CPUs and GPUs. It lets you write kernels directly using a subset of Python. It does not implement the complete CUDA API, but supports enough to tackle many problems. There are other (uninvestigated) options like CuPy and PyCUDA as well.

Lastly, before we get to the code, is the topic of writing efficient kernels. While this is beyond the scope of my post, I can say that I've at least learned it requires understanding several additional concepts beyond general concurrent programming. CUDA kernels are not for the faint of heart. It's not something you can just pick up in a couple of hours or from following one tutorial. One of the things you'll first notice, as a very simple example, is the need to explicitly move data back and forth between CPUs and GPUs. And kernels can't return values, you can only pass inputs and outputs.

For a better intro, I recommend reading the four-part CUDA by Numba Examples series (part 2, part 3, part 4).

Our baseline for this experiment will be numpy's highly-optimized matmul function. It supports vectorized array operations and some multi-threading by releasing the GIL per Parallel Programming with numpy and scipy. The underlying BLAS routines will be optimized for your hardware. This method of matrix multiplication has been tuned over the past several decades.

Here we'll try our own kernel. It's moving the computation from the CPU to the GPU, but it's likely lacking many optimizations that a battle-tested library has. Still, for medium-sized, two-dimensional matrices, we get some performance improvement.

Now we'll use PyTorch's matmul function. It's highly optimized like numpy and has the GPU benefit like the kernel. It's amazingly fast and works with larger, higher-dimensional matrices.

NVIDIA has a profiling tool called Nsight Systems that we can use to see GPU utilization. GPUs are expensive, we would want them to be fully utilized. From the reports, I see that the PyTorch implementation used more threads so that's consistent with higher parallelism. It also seems to have a higher ratio of memory operations vs. kernel operations. I'm not sure I understand what that means, but the kernel operations are sgemm which looks to be a matrix multiplication algorithm akin to numpy using BLAS.



Creating a CUDA kernel has become accessible enough that I could do it in a couple of hours on a laptop, yet my implementation remains far from the top implementations. Matrix multiplication is obviously common and great libraries exist. For less common operations, even if there's a known parallel algorithm, I would hesitate going the custom kernel route again. It's not a one-off thing you would just try to improve performance, it requires a way of thinking and deep optimization knowledge that most software developers don't have. The underlying libraries used by PyTorch are, for example, optimizing use of memory caches, shared vs. global memory accesses, thread utilization, and probably tuning kernel parameters for my specific GPU.

UPDATE:

NVIDIA Finally Adds Native Python Support to CUDA.