Thursday, February 23, 2017

Vectorization and Eigenvectors: Sports Rating Examples

While calculating team ratings for a machine learning-based March Madness prediction, I ran into a couple of situations where my code got slow as I expanded it to include all the teams over all the seasons. By slow I mean several hours, and that was longer than I was willing to wait. I needed to be able to recalculate ratings on-demand, in a few minutes at most.

For my offensive-defensive rating, I started with an implementation similar to the one in Offensive and defensive team ratings for the Premier League 2014-2015:

It's looping over all the rows and columns in the matrix many times. With college basketball having over three-hundred teams, this wasn't going to work for me. I figured out that it could be vectorized and that numpy could handle it more efficiently than me:

Not only is it faster, but I would argue the more concise code is easier to understand as well.

For my Markov Chain ranking, I started with an implementation similar to the one in A Markov Chain ranking of Premier League teams (14/15 season):

I don't mean to disparage these two posts in any way. They are awesome and really helped me. I just had a different situation and had to worry about performance, and here I figured out that I could get rid of both loops:

Eigenvectors to the rescue. The eigenvector for the largest eigenvalue is the stationary distribution we are trying to find and we don't need to do the 100,000 step random walk.