Sunday, October 26, 2014

Bloom Filters

A bloom filter is a data structure that can be used for checking whether an element is in a set when the set is too large to use conventional hashing techniques. It is probabilistic in the sense that false positives are possible, i.e. it might say that an element is in the set when it is really not. There are, however, never false negatives. You can not remove elements from a basic bloom filter so it is well-suited to data streaming problems where the number of elements only increases.

An example use case would be a web crawler that needs to determine whether is has already visited a URL. A small percentage of false positives will be acceptable because no significant portion of the web has only one link (read one incorrect bloom filter lookup) pointing to it.

The bloom filter is implemented as a large array of bits and several independent hash functions. Initially all bits are set to 0. An element to be added has all the hash functions applied to it. The bit corresponding to the result of each hash function is set to 1. To check whether an element has been seen before the same hashing process is used on the element. If all resulting bit positions are 1 then it has probably been seen before.

If m is the number of bits in the array and k is the number of hash functions and n is the number of elements, then the probability of a false positive is:


As an example, consider the case where m = 1 billion, n = 1 million, and k = 5. The probability of a false positive is about 0.94%. Note how each element requires more than one bit. m has to be bigger than n or else all bits would eventually be set to 1 and every lookup would be true, so the bloom filter only makes sense where the number of possible elements is large and a traditional bit array would not work.

Optimal number of hash functions also shows how to calculate m and k when n and the desired false positive rate are known.

UPDATE:

In an Ethereum JavaScript API dependency I found an implementation and they have a real life usage example.

Friday, October 10, 2014

tf-idf

Term frequency-inverse document frequency (tf-idf) is a measure of word importance in a document within a corpus. Words with the highest tf-idf often best characterize the topic of the document.

Words appearing most frequently in the corpus are not the most important words as might be expected. They are common words like stop-words. Rare words are actually the best indicators of importance, especially if they appear multiple times.

tf-idf can be represented by the following equation (thanks Online LaTeX Equation Editor):


Term frequency is calculated as the frequency of word i in document j is divided (normalized) by the maximum frequency of any word k in document j. Inverse document frequency, which accounts for words that are just more common, is calculated as the number of documents N divided by the number of those documents n the word i appears in, and then scaling that by taking the logarithm (the base of the log function does not matter).

This topic has come up in a couple of Coursera classes I have looked at--Web Intelligence and Big Data and Mining Massive Datasets--in the context of a search engine. Basically, you view each document and query (short document) as a vector of tf-idf scores, then you can find the most similar ones using cosine similarity as a way to rank the search results. Inverted indexes allow us to pre-compute much of the tf-idf score.

UPDATE:

scikit-learn has an tf-idf usage example at Clustering text documents using k-means.

Wednesday, October 8, 2014

Software Transactional Memory: Dining Philosophers

Software transactional memory (STM) is a concurrency alternative to lock-based synchronization. It is similar to database transactions except instead of modifying database records, you are modifying variables in shared memory. STM transactions are atomic (a transaction is all or nothing), consistent (all changes made from a transaction must preserve invariants), and isolated (concurrent transactions result in the same system state as transactions executed serially). Since they are in-memory they are not durable.

With STM you are separating identity from state with persistent data structures. While state is constantly changing, the current state for an identity is immutable. Changes made by one thread will not affect another with a reference to the same data structure.1

Without locks multiple threads can also modify different parts of a data structure that would normally share the same lock. Optimistic concurrency control assumes multiple transactions can run in parallel and if wrong, they will be retried.2

Example Code

Motivated by the book Seven Concurrency Models In Seven Weeks, I have implemented the dining philosophers problem in Scala. The book uses Clojure, but ScalaSTM is inspired by the STMs in Haskell and Clojure and I am trying to get better with Scala, so this seemed like a good way to reinforce what I was reading. It turns out that a dining philosophers solution is in the ScalaSTM documentation, but nonetheless it was a worthwhile exercise.

Conclusion

Atomic variables are enough for a lot of problems and functional programming discourages mutable data anyway, but if it is simpler to use multiple mutable variables then STM might be the way to go. As always, you need to compare performance of both though.

1 http://clojure.org/state
2 https://nbronson.github.io/scala-stm/intro.html

Scala Asynchronous IO Echo Server

In previous posts Asynchronous Non-blocking I/O Java Echo Server and Fixing My Asynchronous Non-blocking IO Callback Hell With Monads, I compared using callbacks and futures/promises as concurrency alternatives to creating using a thread pool in a basic echo server. I have now decided to rewrite the server from the second post in Scala since it has better support for futures and promises than Java 8.

Tthe idiomatic way to write an echo server in Scala would be to use the actor model, but that is a topic for another day. My implementation using futures and promises still turned out to be faster than the similar Java 8 implementation (in fact roughly as fast as the true asynchronous NIO implementation) and the code is more concise.

There is still the need to block on accept, read, and write, but I also still think that is unavoidable with the AsynchronousSocketChannel API.