Sunday, August 13, 2017

Python 3 asyncio

A while back I wrote a few posts about asynchronous programming:
So when I learned that Python 3.5 has added async and await operations I knew I had to check it out to see how it compares. asyncio describes itself as "infrastructure for writing single-threaded concurrent code using coroutines, multiplexing I/O access over sockets and other resources, running network clients and servers, and other related primitives".

Coroutines in Python are similar to generators, but coroutines (a function definition using async def) can control where execution continues after the yield (replaced by await). You await a another coroutine or a future. The coroutine approach can replace callbacks.

If we check the time it takes for the program to run, it's 3 seconds (the longest slow operation) and not 6 seconds (the sum of all the slow operations).

Slow operation sleep 1 complete
Slow operation sleep 2 complete
Slow operation sleep 3 complete
Completed in 3.02 seconds

It's similar if we want to get results from each task. We're even able to get them as they are available. 

Slow operation sleep 1 complete
Got result 1
Slow operation sleep 2 complete
Got result 2
Slow operation sleep 3 complete
Got result 3
Completed in 3.00 seconds

With Python's GIL you've never really been able to run multiple threads in parallel. You've had to run concurrent code in multiple processes to leverage multiple CPU cores. With asyncio we are able to at least make single process IO-bound tasks execute faster because it switches between tasks to bypass GIL contention.

For CPU-bound code you still have to use multiple processes to parallelize your code. Python's parallel API limits your ability to use results from these tasks as they become available, as we did in the example above. Waiting for a process to finish and getting a future result both block. asyncio gives us a way to unify the concurrent and parallel APIs.

Slow operation sum 10000000 complete
Got result 49999995000000
Slow operation sum 20000000 complete
Got result 199999990000000
Slow operation sum 30000000 complete
Got result 449999985000000
Completed in 2.91 seconds

There is some overhead in creating the processes but a single 3 second task is only slightly faster than parallel 1, 2, and 3 second tasks.

Slow operation sum 30000000 complete
Got result 449999985000000
Completed in 2.66 seconds

This hints at another use. Given that asyncio uses a single thread, if there are too many IO-bound tasks or if any of them consumes too much CPU, we can overwhelm it. For such a situation it is possible to create multiple processes, each with its own asyncio event loop.

1 process and 3 tasks:
Got 3 results in 6.20 seconds

1 process and 15 tasks:
Got 15 results in 23.94 seconds

The simulated mix of IO-bound and CPU-bound code is interesting. It's slower than just the CPU-bound code on it's own, but faster than the sum of the IO-bound and CPU-bound code. We see some asyncio benefits up to around 15 tasks and then it levels off.

1 process and 30 tasks:
Got 30 results in 46.58 seconds

4 processes and 15*4 tasks:
Got 60 results in 34.11 seconds

8 processes and 15*8 tasks:
Got 120 results in 52.81 seconds

We similarly see some benefit of parallelizing the code up the number of CPU cores I have, then the time is increasing linearly as we would expect.

16 processes and 15*16 tasks:
Got 120 results in 105.98 seconds

If you are looking for more details the comprehensive Python Concurrency series of articles has additional examples like this and goes into more detail on some of the underlying concepts.


Thursday, February 23, 2017

Vectorization and Eigenvectors: Sports Rating Examples

While calculating team ratings for a machine learning-based March Madness prediction, I ran into a couple of situations where my code got slow as I expanded it to include all the teams over all the seasons. By slow I mean several hours, and that was longer than I was willing to wait. I needed to be able to recalculate ratings on-demand, in a few minutes at most.

For my offensive-defensive rating, I started with an implementation similar to the one in Offensive and defensive team ratings for the Premier League 2014-2015:

It's looping over all the rows and columns in the matrix many times. With college basketball having over three-hundred teams, this wasn't going to work for me. I figured out that it could be vectorized and that numpy could handle it more efficiently than me:

Not only is it faster, but I would argue the more concise code is easier to understand as well.

For my Markov Chain ranking, I started with an implementation similar to the one in A Markov Chain ranking of Premier League teams (14/15 season):

I don't mean to disparage these two posts in any way. They are awesome and really helped me. I just had a different situation and had to worry about performance, and here I figured out that I could get rid of both loops:

Eigenvectors to the rescue. The eigenvector for the largest eigenvalue is the stationary distribution we are trying to find and we don't need to do the 100,000 step random walk.