Wednesday, July 12, 2023

Rust to Python and Fibonacci Numbers

While Python continues to be the language of choice for ML projects, I'm increasingly seeing mentions of Rust in packages I use. Hugging Face fast tokenizers are written in Rust to speed up training and tokenization. They say it "Takes less than 20 seconds to tokenize a GB of text on a server's CPU." I recently used Polars, which is also written in Rust, for a quick task where I wanted to run a SQL query on 20 GB of CSV files. It only took about 60 seconds on my laptop. It's a way of bypassing Python's GIL to write code that is parallelizable. This is in contrast to packages like numpy that traditionally have been written in C/C++ for performance reasons. 

How does one turn Rust code into something you can call from Python? It turns out I've already done something similar, compiling Rust into WebAssembly and using it in a JavaScript project. For Python, the process is similar. I followed Calling Rust from Python using PyO3 and re-created the Fibonacci numbers experiment (apparently I'm not the only one with this idea) from my previous post. It's a toy example, just intended to show the ease with which Rust can be leveraged. It ignores more complex data types and any actual multi-threaded code. Perhaps that will be the topic of a future post.

The function in Rust looks like this:

And the Python code that calls it and times it:

Using the same n=35 as in my JavaScript experiment, I'm seeing about .05 seconds for Rust vs. 7.31 seconds for pure Python. The likelihood of wanting/needing this in a Python project seems greater than with JavaScript. My guess is that the trend continues.

Monday, January 30, 2023

Queueing Simulations

Waiting in lines is something we've all experienced. But how long will the wait for a table really be? How many checkouts should the grocery store have open? Queueing theory gives us the tools to be able to answer questions like these and more. It's also relevant to understanding software under heavy load. I've seen it come up recently in The Amazon Builder's Library Avoiding Insurmountable Queue Backlogs and Site Reliability Engineering's Addressing Cascading Failures chapter. 

Since you are reading this blog it is assumed that you are already familiar with queues and the idea of using them in software systems. To summarize, queues can improve availability and throughput at the expense of latency and resource consumption. Here we will see what the field of queueing theory can teach us about how and when to best use them. 

The Builder's Library article warns us of the bi-modal behavior of queue-based systems: the behavior can be very different based on whether there is a backlog or not. Recovery time after an outage can be dramatically increased and time can be wasted doing work that is no longer useful. We then get the first hints of how to manage this. Using more than one queue helps shape traffic. LIFO-ish behavior can be more desirable than FIFO. The SRE book discusses throttling via small queue sizes so requests are rejected when the incoming request rate is too high to be sustained. Traffic patterns, queue sizes, and the number of threads removing work from the queue are all intertwined. Can we better quantify these recommendations though?

I first learned of queueing theory in The Essential Guide to Queueing Theory e-book from VividCortex and wanted to revisit it in the context of the SRE reading. Right from the beginning we are warned that while queues are linear data structures, queueing systems behave non-linearly. Queueing theory is probabilistic. We won't understand the behavior exactly, but will know about wait times on average or a distribution of wait times.

The first thing they suggest we must understand is that queueing happens even when there is enough capacity to do the work. This is because work arrives to the queue in irregular sizes and at irregular intervals. Queueing gets worse at high utilization, when there is high variability, and when fewer workers are servicing the queues.

Any system can be decomposed into networks of queues and workers. These are the parameters and metrics commonly used to understand these systems:

  • arrival rate to the queue (λ or A) - in a stable system this is the same as throughput
  • average wait time or residence time in the queue (W or Wq)
  • average post-queue service time (S)
  • service rate (μ) - inverse of service time
  • latency or residence time (R) - sum of wait time and service time
  • worker utilization (ρ or U) - 0 to 1
  • average number of requests waiting or in service concurrently (L or Lq)
  • number of workers (M)

Little's Law states that concurrency is arrival rate times residence time: L = λR and Lq = λWq

The Utilization Law states that utilization is throughput times service time: ρ = λS or ρ = λ/μ or with M workers ρ = λS/M

Kendall's Notation is a shorthand for describing queue systems. Normally we'll be working with M/M/m systems. The first M says events arrive randomly and independently (memoryless) meaning they are generated by a Poisson process. The second M says the service times are exponentially distributed. The last m is the number of workers or M from above. These are safe assumptions unless you know otherwise.

For a M/M/1 system R = S / (1 - ρ) which is a fast-growing curve (the non-linear mentioned above). Residence time is inversely proportional to idle capacity. By rearranging the equations above we can solve for other parameters as well. For systems beyond M/M/1 it gets more complicated. For M/M/m, for example, we need to bring in Erlang's C-formula.

To do that, and to see these laws in action, it's time to write some code. There appear to be several Python packages for simulating queueing systems. I will use Ciw for this simulation. We'll assume we have 1,000 requests per minute, several servers pulling work from a queue, and need enough capacity such that 2 servers could be offline without the system falling behind.


 servers  utilization %  latency (sec) 
383.35
463.18
550.15

We can use this simulation to figure out what "levers to pull" to get the behavior we want. It could even become a Linear Programming optimization problem. Based on the arrival rate and service rate it's easy to see we need at least 3 servers, but not as easy to see how the latency would change going from 3 to 5 or what the minimum queue size is.

While not always intuitive, especially with multiple queues, there are ways to understand how queue-based systems will behave. If you are interested in the topic, I suggest reading the linked resources as they go into much more detail. If you have other examples of queueing theory in action, please share in the comments.