programming opiethehokie
Monday, February 17, 2025
Merkle Trees
Monday, January 15, 2024
CUDA Kernels: Speeding up Matrix Multiplication
First, what is CUDA? CUDA is a general purpose parallel computing platform and API. It gives direct access to NVIDIA GPU's virtual instruction set and parallel computation. It works with C, C++, Fortran, has Java and Python wrappers, and is supposed to be easier to use than earlier APIs like OpenCL. CUDA allows us to execute kernels which are functions compiled for GPUs, separately from the main program. This is how the majority of deep learning taking place today functions.
Second, what makes GPUs so special? GPUs are generally slower than CPUs at one operation but excel at operations that can be parallelized. They can have thousands of cores and higher memory bandwidth. If we want to tackle more than embarrassingly parallel problems, like matrix multiplication, this means we need a parallel algorithm to make use of the GPU.
Third, how do we access it from Python? It turns out there are several options, or several Python wrappers for this. I chose to go with Numba, a just-in-time (JIT) compiler that can target both CPUs and GPUs. It lets you write kernels directly using a subset of Python. It does not implement the complete CUDA API, but supports enough to tackle many problems. There are other (uninvestigated) options like CuPy and PyCUDA as well.
Lastly, before we get to the code, is the topic of writing efficient kernels. While this is beyond the scope of my post, I can say that I've at least learned it requires understanding several additional concepts beyond general concurrent programming. CUDA kernels are not for the faint of heart. It's not something you can just pick up in a couple of hours or from following one tutorial. One of the things you'll first notice, as a very simple example, is the need to explicitly move data back and forth between CPUs and GPUs. And kernels can't return values, you can only pass inputs and outputs.
For a better intro, I recommend reading the four-part CUDA by Numba Examples series (part 2, part 3, part 4).
Our baseline for this experiment will be numpy's highly-optimized matmul function. It supports vectorized array operations and some multi-threading by releasing the GIL per Parallel Programming with numpy and scipy. The underlying BLAS routines will be optimized for your hardware. This method of matrix multiplication has been tuned over the past several decades.
Here we'll try our own kernel. It's moving the computation from the CPU to the GPU, but it's likely lacking many optimizations that a battle-tested library has. Still, for medium-sized, two-dimensional matrices, we get some performance improvement.
Now we'll use PyTorch's matmul function. It's highly optimized like numpy and has the GPU benefit like the kernel. It's amazingly fast and works with larger, higher-dimensional matrices.
NVIDIA has a profiling tool called Nsight Systems that we can use to see GPU utilization. GPUs are expensive, we would want them to be fully utilized. From the reports, I see that the PyTorch implementation used more threads so that's consistent with higher parallelism. It also seems to have a higher ratio of memory operations vs. kernel operations. I'm not sure I understand what that means, but the kernel operations are sgemm which looks to be a matrix multiplication algorithm akin to numpy using BLAS.
Creating a CUDA kernel has become accessible enough that I could do it in a couple of hours on a laptop, yet my implementation remains far from the top implementations. Matrix multiplication is obviously common and great libraries exist. For less common operations, even if there's a known parallel algorithm, I would hesitate going the custom kernel route again. It's not a one-off thing you would just try to improve performance, it requires a way of thinking and deep optimization knowledge that most software developers don't have. The underlying libraries used by PyTorch are, for example, optimizing use of memory caches, shared vs. global memory accesses, thread utilization, and probably tuning kernel parameters for my specific GPU.
UPDATE:
NVIDIA Finally Adds Native Python Support to CUDA.
Wednesday, July 12, 2023
Rust to Python and Fibonacci Numbers
While Python continues to be the language of choice for ML projects, I'm increasingly seeing mentions of Rust in packages I use. Hugging Face fast tokenizers are written in Rust to speed up training and tokenization. They say it "Takes less than 20 seconds to tokenize a GB of text on a server's CPU." I recently used Polars, which is also written in Rust, for a quick task where I wanted to run a SQL query on 20 GB of CSV files. It only took about 60 seconds on my laptop. It's a way of bypassing Python's GIL to write code that is parallelizable. This is in contrast to packages like numpy that traditionally have been written in C/C++ for performance reasons.
How does one turn Rust code into something you can call from Python? It turns out I've already done something similar, compiling Rust into WebAssembly and using it in a JavaScript project. For Python, the process is similar. I followed Calling Rust from Python using PyO3 and re-created the Fibonacci numbers experiment (apparently I'm not the only one with this idea) from my previous post. It's a toy example, just intended to show the ease with which Rust can be leveraged. It ignores more complex data types and any actual multi-threaded code. Perhaps that will be the topic of a future post.
The function in Rust looks like this:
And the Python code that calls it and times it:
Using the same n=35 as in my JavaScript experiment, I'm seeing about .05 seconds for Rust vs. 7.31 seconds for pure Python. The likelihood of wanting/needing this in a Python project seems greater than with JavaScript. My guess is that the trend continues.