Monday, December 28, 2015

Reactive Actors

I've been meaning to revisit reactive programming and the actor model for a while now. I first learned about them in the Principles of Reactive Programming Coursera class and then actors came up again in the Seven Concurrency Models in Seven Weeks book. The Scala I picked up is quickly being forgotten and I haven't done a post with code in a while, so here I'll get back into that and create a simple application using Akka and RxScala.

Actor model

The developerWorks article JVM Concurrency: Acting asynchronously with Akka gives a good introduction to the actor model:
The actor model for concurrent computations builds up systems based on primitives called actors. Actors take actions in response to inputs called messages. Actions can include changing the actor's own internal state as well as sending off other messages and even creating other actors. All messages are delivered asynchronously, thereby decoupling message senders from receivers. Because of this decoupling, actor systems are inherently concurrent: Any actors that have input messages available can be executed in parallel, without restriction.
Then JVM Concurrency: Building actor applications with Akka goes on to explain the advantages of this approach:
If you compose your actors and messages correctly, you end up with a system in which most things happen asynchronously. Asynchronous operation is harder to understand than a linear approach, but it pays off in scalability. Highly asynchronous programs are better able to use increased system resources (for example, memory and processors) either to accomplish a particular task more quickly or to handle more instances of the task in parallel. With Akka, you can even extend this scalability across multiple systems, by using remoting to work with distributed actors.
At first the actor model may sound the same as what I described in my Communicating Sequential Processes post because both involve message passing, but the two concurrency models have several differences:
  • Actors have identities while CSPs are anonymous
  • Actors transmit messages to named actors while CSPs transmit messages using channels
  • Actors transmit messages asynchronously while CSPs can't transmit a message until the sender is ready to receive it
My impression, and I could be wrong, is that actors more naturally extend beyond a single machine to a distributed system since the sending and receiving of messages is decoupled. A quick search does turn up distributed channels in pycsp, though, so it seems that both can be distributed.

Reactive applications

The Reactive Manifesto details four qualities of reactive applications:
  • responsive - the system responds in a timely manner if at all possible
  • resilient - the system stays responsive in the face of failure
  • elastic - the system stays responsive under varying workloads
  • message driven - the system relies on asynchronous message passing between components
Where Akka describes itself as a toolkit and runtime, RxScala only claims to be a library for composing asynchronous and event-based programs using observable sequences. To me it's not clear how it helps us achieve all four qualities (or if it even intends to).  Nevertheless, the ReactiveX introduction explains their advantages:
The ReactiveX Observable model allows you to treat streams of asynchronous events with the same sort of simple, composable operations that you use for collections of data items like arrays. It frees you from tangled webs of callbacks, and thereby makes your code more readable and less prone to bugs.
This means is that the methods returning Observables can be implemented using thread pools, non-blocking I/O, actors, or anything else. This is how ReactiveX and Akka will be used together: Actors are the concurrency implementation for services communicating with asynchronous messages.

Combining Akka and RxScala

I came up with the following short example. First I wrote a couple of methods returning an Observable to get a feel for it, then added the stockQuote() method which also uses an actor in it's implementation:


Running it produces the expected output, something like:

6
8
broken service
GOOG: 253.22

I can really see the potential in the Observable model, especially after reading more about it at The Netflix Tech Blog. If you were already using actors maybe combining them like this could make sense. I also need to checkout Akka Streams which seems like a similar idea.

UPDATE

Ray is a framework for parallelizing ML workloads. They use the actor model as a way of coordinating work and maintaining state.

Friday, September 4, 2015

Bitwise operations 101

Whenever I see lists of interview questions it seems like bitwise operations are on there. I've never used these at work and I don't even remember needing to do it in school. I wanted to see if I could answer a few of these. After a quick review of what each operation does this is what I came up with.

The first three methods--nthBitSet, countBits, and isPalindrome--are fairly intuitive and I feel like they could be reasonable interview questions.

For adding and multiplying I cheated and looked up the algorithms. They sort of make sense, like the repeated additions for multiplication, but it would have taken a long time to come up with that on my own.

Thursday, August 27, 2015

30 ideas sort of related to NLP

Over the past year or so, as I was trying to learn more about machine learning, one related topic I haven't gotten to is natural language processing (NLP). I've also had Matthew Russell's Mining the Social Web sitting unread on my bookshelf for a while. Even though it's a bit outdated at this point with references to Google Buzz (looks like there is an updated edition available though) I think it will be good for picking up some NLP basics. It's been described as a successor to Collective Intelligence, which I thought was a fantastic book, so I'm really been looking forward to having the time to finally get through it. This post is going to be lnotes of what I learn as I learn it.
  • Even though lexical diversity (unique tokens / total number of tokens) and term frequency distributions are simple, they are still important and useful to start with
  • The Natural Language Toolkit (NLTK) is a popular Python module for NLP
  • Microformats and HTML 5's microdata are ways of decorating markup to expose structured information
  • CouchDB can be used to build up indexes on data and perform frequency analysis through MapReduce operations
  • Add Lucene to enable full-text searching of CouchDB documents
  • I've known Redis as a key-value store or cache, but it's also known as a data structure server because it can contain lists, sets, hashes, etc.
  • When analyzing a graph (like Twitter followers), a graph database can help by providing common operations like clique detection or breadth-first search
  • There are many visualization tools besides matplotlib and Graphviz available from Python like Ubigraph, Protovis, and SIMILE Timeline
  • Edit distance (aka Levenshtein distance) is a measure of how many changes it would take to convert one string to another 
  • n-gram similarity is a measure of common n-grams between samples
  • Jaccard index measures the similarity of two sets (|A ∩  B| / |A ∪ B|)
  • Calculating the distance between every pair for clustering a large n can be impossible (I think the book could have gone into more detail here and mentioned an alternative approach like what I wrote about at Locality Sensitive Hashing) but k-means clustering at O(kn) can approximate well
  • Two visualizations I recognized but didn't know by name: Dorling Cartograms and dendrograms
  • New (to me) visualization for trees: radial trees and sunburst visualizations
  • Natural language frequency analysis follows Zipf's Law (a power law and long tail distribution) meaning a word's frequency is inversely proportional to its rank in the frequency table 
  • TF-IDF is one of the fundamental information retrieval techniques for retrieving documents from a corpus (I wrote about it at tf-idf)
  • A common way to find similar documents is cosine similarity where the vectors are TF-IDF weights
  • Document similarities can be visualized with arc and matrix diagrams
  • Much information is gained when you can look at multiple tokens at a time, like bi-grams (2-grams)
  • Collocations are sequences of words that occur together often
  • Contingency tables are data structures for expressing frequencies associated with the terms of a bi-gram
  • Dice's coefficient, likelihood ratio, chi-square, and Student's t-score, in addition to Jaccard index, are all statistical approaches that can be used for discovering collocations
  • Stemming and lemmatization
  • Stop-words
  • A typical NLTK NLP pipeline is:
    • end of sentence (EOS) detection
    • tokenization
    • part-of-speech tagging
    • chunking - assembling compound tokens for logical concepts
    • extraction - tagging chunks as named entities
  • Filtering out sentences containing frequently occurring words appearing near each other is a basic way to summarize documents
  • Extracting entities from documents can address some of the shortcomings of the bag-of-words approach TF-IDF (like homographs and different capitalizations), which n-grams don't completely solve
  • Use the F1 score to measure accuracy against manually tagged documents
  • Facebook's Open Graph Protocol enables you to turn any web page into a social graph by injecting RDFa metadata into the page
  • The semantic web, if realized through standards like RDF and OWL, would be a domain-agnostic way to enable machines to understand and use web information

Tuesday, May 26, 2015

Communicating Sequential Processes: Goroutines and Channels

This post, like Software Transactional Memory: Dining Philosophers, is motivated by the book Seven Concurrency Models in Seven Weeks. One of the concurrency models discussed is communicating sequential processes (CSP). Even though it was completely new to me, the idea has been around for several decades. I've wanted to check out Go for a while now and since it's a language whose design was influenced by CSP, it seems like a good time to write a Go program.

For concurrent programming, Go encourages shared values to be passed around on channels instead of by sharing memory. The value can only be accessed by one goroutine at a time. Unbuffered channels combine communication with synchronization, and buffered channels can be used like a semaphore. A goroutine is a function executing concurrently with other goroutines in the same address space. It's multiplexed onto multiple OS threads so a blocking goroutine doesn't hold up other goroutines.

In addition to the Seven Concurrency Models in Seven Weeks book, I think the Clojure core.async Channels blog sums up the motivation for channels well. Basically, they are an alternative to using queues to communicate between different components and to using events/callbacks. You avoid avoid thread overhead and callback hell.

I wrote a short program for a vending machine where money deposits and soda dispenses are values passed on channels. It was more difficult than expected to think this way, but the two goroutines only communicate over channels which was the intent.

Sunday, April 26, 2015

Understanding Docker: Union File Systems

Earlier this week I took a Docker class at work. Having used Docker before I already knew some of the material, but that was a good thing because it gave me more of a chance to understand how Docker works. In this post I want to write about the idea of a union file system and how that relates to Docker images.

An image is made up of layers. Each layer is based on one line in a Dockerfile or one commit of a changed container. This means that doing multiple commands joined by "&&" on one Dockerfile line is a way of collapsing multiple layers into one. Images are also immutable. Changing an image results in a new layer, not a changed image. That new layer only contains the change, though, and not a whole new image.

To combine layers into an image Docker uses union file systems. The file systems from each layer are overlaid into one file system. The layers making up an image could modify the same files so later layers would have a higher priority over the beginning layers.

The layers of the image are read-only, but Docker adds a read-write layer when running a container. That allows the container file system to appear writable even though it isn't, which is known as copy-on-write. I didn't understand this when first using Docker, but it means that multiple running containers from the same image are sharing the same file system until one of them changes a file.

Sunday, February 8, 2015

Eigenvectors and PageRank

Eigenvectors are another topic that I used in a math class or two but never really understood what they were for. As I mentioned in my last post Bipartite Graphs and Google Adwords, I like how the professors of the Coursera class Mining Massive Datasets gave great examples of how to apply some academic concepts to real world problems. Their PageRank material is no different as it explains how eigenvectors are used in the algorithm. Here is my summary of the class' link analysis chapter which I'm sure is greatly simplified from what Google really does, but I think it's interesting nonetheless.

Early search engines used an inverted index which I covered in my tf-idf post. That approach is vulnerable to term spam. People would put invisible text on their pages (same color as the background) to trick web crawlers into thinking that the page was about a topic it wasn't. The PageRank algorithm solves that problem by giving more importance to pages that have many in-links, and especially important in-links. Even if someone creates millions of fake pages that link to their real page, it's importance will not be that high because the fake pages will have a low importance as no one is linking to them.

The web can be thought of as a directed graph where web pages are the vertexes and links are the arcs. The basic idea of PageRank is that you simulate web surfers starting at random spots in the web graph and then following random out-links. The most important pages, the pages with the most in-links, will end up with the most surfers.

To implement this simulation, consider a transition matrix M which has n rows and columns where n is the number of pages crawled. Mij has the value 1/k if page j has k arcs out and one of them is to page i, otherwise it has the value 0. Start with a vector v0 which has all elements set to 1/n. The first step is to multiply v0 by M, the second step is to multiple by M2, and so on (this is an example of a Markov process which means you can make predictions on its future state based only on the present state and do not need the history).

If the web graph is one strongly connected component and doesn't have any dead ends, then v = Mv is the limiting distribution. The limit would be reached when multiplying the distribution by M another time doesn't change the distribution. Vector v is an eigenvector of M because vλMv. Vector v is also the principal eigenvector because M is a stochastic matrix and the eigenvalue associated with the principal eigenvector is 1. In practice it takes 50-75 iterations to reach this limit.

M is too big for us to use Gaussian elimination. M is also very sparse, so it makes sense to store only the non-zero elements. An entire column can be represented by the out-degree of a page and the row number of the non-zero elements (since those element's values will be 1 divided by the out-degree). That means we only need to store a little over 4 bytes for each non-zero element. Then we can solve for v using MapReduce and v tells us the "page rank" of each page.

In reality the web is not strongly connected. There are dead ends and there are spider traps, or cycles, in the graph. This problem is solved by the concept of taxation. Each surfer has a small probability of teleporting to a random page instead of following an out-link. There are also other algorithm variations for dealing with link spam. For example, besides page content, you also consider the link text or words near the link, so you are getting other people's take on what the page is about instead of relying solely on the page owner.

UPDATE:

This same idea can be applied to sports ratings where links are replaced by something like goals scored. My Machine Learning for NCAA Basketball Prediction - Performance Edition post has more details and some code.

Bipartite Graphs and Google Adwords

Continuing my series of posts relating to the Coursera Mining Massive Datasets class, this post summarizes the chapter on Google AdWords. I liked how the professors showed it to be related to matching in a bipartite graph--something that I learned about in a university graph theory class but without such a practical context as web advertising. I think that too often happens, at least in my experience. I've taken a lot of math classes (and done well in them), but I still don't feel like I had much practice applying what I learned to real-world problems.

Simple-bipartite-graph

To be clear, AdWords is for advertisers and AdSense is for website owners. Google provides an explanation in The difference between AdWords and AdSense. Today I'm focusing on how AdWords works.

The "adwords" model matches web searches with advertisements. Advertisers bid to be shown in response to certain search queries. If a user clicks on one of the ads shown to them then the advertiser will pay. Of course, the most relevant ads are clicked more often than less attractive ads, so the challenge is displaying the ads with the highest bids that are most likely to be clicked, while staying within an advertisers budget. This is a problem that traditional newspapers and magazines don't really have. They can only target specific niches of people (like people who buy Golf Digest), but web advertisers can target individuals (like people searching for a specific brand of golf clubs).

Google knows the click-through rate, or the percentage of the time an ad is clicked when it is displayed. It knows how much of an advertiser's budget has been spent. Google also knows what people have searched for in the past, but it doesn't know for sure what people are going to search for in the future. The "adwords" problem therefore needs a greedy algorithm because all you can do is make the best choice for each search and hope it results in the best overall outcome.

A simplified version of this "adwords" problem can be modeled as maximal matching in bipartite graphs. A matching is a subset of edges where no vertex is an end of two or more edges. A perfect matching contains every vertex and a maximal matching is the largest possible matching for the graph. In this case one side of the graph is search queries, the other side is ads, and the edges are who the ads could be shown to. The maximal matching is the best way to display the ads. The simplifying assumptions are that one ad is shown, all advertisers have the same budget, click-through rates are the same, and bids are either 0 or 1.

The obvious greedy algorithm for matching will consider edges in the order they are given. An edge is part of the matching if neither end is connected to an edge already added to the matching. The competitive ratio is defined as the ratio between the worst online algorithm and the best offline algorithm. The offline solution is the optimal solution because all information about the problem is known in advance. For our bipartite matching solution the ratio is only 0.5, that is it will always find at least half as many matches as what is optimal.

It is possible to do better. The more realistic BALANCE algorithm considers the highest bidder and the highest remaining budget. By doing so it results in a competitive ratio of 0.63 which is the highest possible for an online algorithm.

UPDATE:

Bipartite matching implementations can be found in NetworkX and SciPy.