Sunday, February 8, 2015

Eigenvectors and PageRank

Eigenvectors are another topic that I used in a math class or two but never really understood what they were for. As I mentioned in my last post Bipartite Graphs and Google Adwords, I like how the professors of the Coursera class Mining Massive Datasets gave great examples of how to apply some academic concepts to real world problems. Their PageRank material is no different as it explains how eigenvectors are used in the algorithm. Here is my summary of the class' link analysis chapter which I'm sure is greatly simplified from what Google really does, but I think it's interesting nonetheless.

Early search engines used an inverted index which I covered in my tf-idf post. That approach is vulnerable to term spam. People would put invisible text on their pages (same color as the background) to trick web crawlers into thinking that the page was about a topic it wasn't. The PageRank algorithm solves that problem by giving more importance to pages that have many in-links, and especially important in-links. Even if someone creates millions of fake pages that link to their real page, it's importance will not be that high because the fake pages will have a low importance as no one is linking to them.

The web can be thought of as a directed graph where web pages are the vertexes and links are the arcs. The basic idea of PageRank is that you simulate web surfers starting at random spots in the web graph and then following random out-links. The most important pages, the pages with the most in-links, will end up with the most surfers.

To implement this simulation, consider a transition matrix M which has n rows and columns where n is the number of pages crawled. Mij has the value 1/k if page j has k arcs out and one of them is to page i, otherwise it has the value 0. Start with a vector v0 which has all elements set to 1/n. The first step is to multiply v0 by M, the second step is to multiple by M2, and so on (this is an example of a Markov process which means you can make predictions on its future state based only on the present state and do not need the history).

If the web graph is one strongly connected component and doesn't have any dead ends, then v = Mv is the limiting distribution. The limit would be reached when multiplying the distribution by M another time doesn't change the distribution. Vector v is an eigenvector of M because vλMv. Vector v is also the principal eigenvector because M is a stochastic matrix and the eigenvalue associated with the principal eigenvector is 1. In practice it takes 50-75 iterations to reach this limit.

M is too big for us to use Gaussian elimination. M is also very sparse, so it makes sense to store only the non-zero elements. An entire column can be represented by the out-degree of a page and the row number of the non-zero elements (since those element's values will be 1 divided by the out-degree). That means we only need to store a little over 4 bytes for each non-zero element. Then we can solve for v using MapReduce and v tells us the "page rank" of each page.

In reality the web is not strongly connected. There are dead ends and there are spider traps, or cycles, in the graph. This problem is solved by the concept of taxation. Each surfer has a small probability of teleporting to a random page instead of following an out-link. There are also other algorithm variations for dealing with link spam. For example, besides page content, you also consider the link text or words near the link, so you are getting other people's take on what the page is about instead of relying solely on the page owner.

UPDATE:

This same idea can be applied to sports ratings where links are replaced by something like goals scored. My Machine Learning for NCAA Basketball Prediction - Performance Edition post has more details and some code.

Bipartite Graphs and Google Adwords

Continuing my series of posts relating to the Coursera Mining Massive Datasets class, this post summarizes the chapter on Google AdWords. I liked how the professors showed it to be related to matching in a bipartite graph--something that I learned about in a university graph theory class but without such a practical context as web advertising. I think that too often happens, at least in my experience. I've taken a lot of math classes (and done well in them), but I still don't feel like I had much practice applying what I learned to real-world problems.

Simple-bipartite-graph

To be clear, AdWords is for advertisers and AdSense is for website owners. Google provides an explanation in The difference between AdWords and AdSense. Today I'm focusing on how AdWords works.

The "adwords" model matches web searches with advertisements. Advertisers bid to be shown in response to certain search queries. If a user clicks on one of the ads shown to them then the advertiser will pay. Of course, the most relevant ads are clicked more often than less attractive ads, so the challenge is displaying the ads with the highest bids that are most likely to be clicked, while staying within an advertisers budget. This is a problem that traditional newspapers and magazines don't really have. They can only target specific niches of people (like people who buy Golf Digest), but web advertisers can target individuals (like people searching for a specific brand of golf clubs).

Google knows the click-through rate, or the percentage of the time an ad is clicked when it is displayed. It knows how much of an advertiser's budget has been spent. Google also knows what people have searched for in the past, but it doesn't know for sure what people are going to search for in the future. The "adwords" problem therefore needs a greedy algorithm because all you can do is make the best choice for each search and hope it results in the best overall outcome.

A simplified version of this "adwords" problem can be modeled as maximal matching in bipartite graphs. A matching is a subset of edges where no vertex is an end of two or more edges. A perfect matching contains every vertex and a maximal matching is the largest possible matching for the graph. In this case one side of the graph is search queries, the other side is ads, and the edges are who the ads could be shown to. The maximal matching is the best way to display the ads. The simplifying assumptions are that one ad is shown, all advertisers have the same budget, click-through rates are the same, and bids are either 0 or 1.

The obvious greedy algorithm for matching will consider edges in the order they are given. An edge is part of the matching if neither end is connected to an edge already added to the matching. The competitive ratio is defined as the ratio between the worst online algorithm and the best offline algorithm. The offline solution is the optimal solution because all information about the problem is known in advance. For our bipartite matching solution the ratio is only 0.5, that is it will always find at least half as many matches as what is optimal.

It is possible to do better. The more realistic BALANCE algorithm considers the highest bidder and the highest remaining budget. By doing so it results in a competitive ratio of 0.63 which is the highest possible for an online algorithm.

UPDATE:

Bipartite matching implementations can be found in NetworkX and SciPy.