Friday, November 14, 2014

Locality Sensitive Hashing

In my previous post on tf-idf, I summarized my notes on how to look for similar documents in the sense that the documents are about the same topic. However, there is a slightly different problem of finding similar documents that are really the same document. They could be, for example, plagiarized, a mirror website, or a news story from the same source carried by multiple news outlets. When there are too many pairs of documents to efficiently compare them all, we can use a technique called locality sensitive hashing or LSH. Today I'll summarize my LSH notes.

In order to find these lexically similar documents, you need to represent each document as a set of sentences or phrases. One technique for this is called shingling and a k-shingle is any substring with length k in the document. The value of k needs to be large enough that the probability of a shingle appearing in any given document is low. To represent shingles more concisely, use a hash function to hash the substrings to a number. Then, for example, each shingle is represented as four bytes instead of a k byte substring.

Even with four byte shingles the set for a document still takes up more space than the document itself. We need to replace these large sets with smaller representations called signatures. A signature is the result of several hundred minhash calculations. Consider a matrix where there is a row for each shingle and a column for every document sets. The matrix has a 1 for a particular row and column if that shingle appears in the set. Pick a permutation of rows and then the minhash value for a column is the number of the first row (in permuted order) with a 1 in that column.

In practice, storing the sparse matrix and permuting and sorting millions of rows is too time-consuming, but it can be simulated by random hash functions that map row numbers to as many buckets as there are rows (and ignoring the small number of collisions). Instead of picking n random permutations you pick n randomly chosen hash functions. If a column has a 1 for row r, then the corresponding signature matrix column is updated with the bucket numbers as long as they are less than the current values.

From the signature matrix we can estimate the Jaccard similarities of the underlying sets. Mind = blown.

It still might not be possible to efficiently check the similarity of all pairs of documents because there are just too many pairs. To find the most similar pairs or all pairs above a certain similarity threshold, we need to focus only on pairs that are likely to be similar. The LSH approach here is to hash items several times and hope that similar pairs will end up in the same bucket at least once. Any pair that hashed to the same bucket in any of the hashings is a candidate pair (even though some of them are false positives).

Divide the signature matrix into b bands where each band consists of r rows. Hash each vector (the portion of each column in a band) to a large number of buckets. Each band has its own buckets. The similarity threshold is an S-curve and is approximately:

For a Jaccard similarity s, the probability of a false negative, i.e. missing a candidate pair, is given by:


Finally, examine the candidate pairs to determine if they are above the previously determined similarity threshold.

Besides the minhash family of functions, there are also other function families that apply to other distance measures like Hamming distance or edit distance. LSH works well for fingerprint matching and entity-resolution.

UPDATE:

Spark implements several LSH operations and has more info on how they can be used.


Resources:
http://infolab.stanford.edu/~ullman/mmds/ch3.pdf

Sunday, October 26, 2014

Bloom Filters

A bloom filter is a data structure that can be used for checking whether an element is in a set when the set is too large to use conventional hashing techniques. It is probabilistic in the sense that false positives are possible, i.e. it might say that an element is in the set when it is really not. There are, however, never false negatives. You can not remove elements from a basic bloom filter so it is well-suited to data streaming problems where the number of elements only increases.

An example use case would be a web crawler that needs to determine whether is has already visited a URL. A small percentage of false positives will be acceptable because no significant portion of the web has only one link (read one incorrect bloom filter lookup) pointing to it.

The bloom filter is implemented as a large array of bits and several independent hash functions. Initially all bits are set to 0. An element to be added has all the hash functions applied to it. The bit corresponding to the result of each hash function is set to 1. To check whether an element has been seen before the same hashing process is used on the element. If all resulting bit positions are 1 then it has probably been seen before.

If m is the number of bits in the array and k is the number of hash functions and n is the number of elements, then the probability of a false positive is:


As an example, consider the case where m = 1 billion, n = 1 million, and k = 5. The probability of a false positive is about 0.94%. Note how each element requires more than one bit. m has to be bigger than n or else all bits would eventually be set to 1 and every lookup would be true, so the bloom filter only makes sense where the number of possible elements is large and a traditional bit array would not work.

Optimal number of hash functions also shows how to calculate m and k when n and the desired false positive rate are known.

UPDATE:

In an Ethereum JavaScript API dependency I found an implementation and they have a real life usage example.

Friday, October 10, 2014

tf-idf

Term frequency-inverse document frequency (tf-idf) is a measure of word importance in a document within a corpus. Words with the highest tf-idf often best characterize the topic of the document.

Words appearing most frequently in the corpus are not the most important words as might be expected. They are common words like stop-words. Rare words are actually the best indicators of importance, especially if they appear multiple times.

tf-idf can be represented by the following equation (thanks Online LaTeX Equation Editor):


Term frequency is calculated as the frequency of word i in document j is divided (normalized) by the maximum frequency of any word k in document j. Inverse document frequency, which accounts for words that are just more common, is calculated as the number of documents N divided by the number of those documents n the word i appears in, and then scaling that by taking the logarithm (the base of the log function does not matter).

This topic has come up in a couple of Coursera classes I have looked at--Web Intelligence and Big Data and Mining Massive Datasets--in the context of a search engine. Basically, you view each document and query (short document) as a vector of tf-idf scores, then you can find the most similar ones using cosine similarity as a way to rank the search results. Inverted indexes allow us to pre-compute much of the tf-idf score.

UPDATE:

scikit-learn has an tf-idf usage example at Clustering text documents using k-means.

Wednesday, October 8, 2014

Software Transactional Memory: Dining Philosophers

Software transactional memory (STM) is a concurrency alternative to lock-based synchronization. It is similar to database transactions except instead of modifying database records, you are modifying variables in shared memory. STM transactions are atomic (a transaction is all or nothing), consistent (all changes made from a transaction must preserve invariants), and isolated (concurrent transactions result in the same system state as transactions executed serially). Since they are in-memory they are not durable.

With STM you are separating identity from state with persistent data structures. While state is constantly changing, the current state for an identity is immutable. Changes made by one thread will not affect another with a reference to the same data structure.1

Without locks multiple threads can also modify different parts of a data structure that would normally share the same lock. Optimistic concurrency control assumes multiple transactions can run in parallel and if wrong, they will be retried.2

Example Code

Motivated by the book Seven Concurrency Models In Seven Weeks, I have implemented the dining philosophers problem in Scala. The book uses Clojure, but ScalaSTM is inspired by the STMs in Haskell and Clojure and I am trying to get better with Scala, so this seemed like a good way to reinforce what I was reading. It turns out that a dining philosophers solution is in the ScalaSTM documentation, but nonetheless it was a worthwhile exercise.

Conclusion

Atomic variables are enough for a lot of problems and functional programming discourages mutable data anyway, but if it is simpler to use multiple mutable variables then STM might be the way to go. As always, you need to compare performance of both though.

1 http://clojure.org/state
2 https://nbronson.github.io/scala-stm/intro.html

Scala Asynchronous IO Echo Server

In previous posts Asynchronous Non-blocking I/O Java Echo Server and Fixing My Asynchronous Non-blocking IO Callback Hell With Monads, I compared using callbacks and futures/promises as concurrency alternatives to creating using a thread pool in a basic echo server. I have now decided to rewrite the server from the second post in Scala since it has better support for futures and promises than Java 8.

Tthe idiomatic way to write an echo server in Scala would be to use the actor model, but that is a topic for another day. My implementation using futures and promises still turned out to be faster than the similar Java 8 implementation (in fact roughly as fast as the true asynchronous NIO implementation) and the code is more concise.

There is still the need to block on accept, read, and write, but I also still think that is unavoidable with the AsynchronousSocketChannel API.

Saturday, August 9, 2014

Vectorization: A Matrix Multiplication Example

It was in the Coursera Machine Learning class that I first learned about vectorization. When dealing with vectors and matrices, instead of writing loops and computing operations on scalars, you can instead just perform the operations directly on the higher dimensional data structures. It is a type of parallelism where one processor performs operations on multiple data simultaneously, but does not involve any concurrency.

Vectorization is made possible by array programming languages and libraries like Octave, R, and NumPy. It can be a compiler optimization or some kind of interface can be made available to the programmer so they can indicate the operations to vectorize. At a lower level, this is implemented with SIMD processor instructions or using, for example, 32-bit instructions to simulate vector computations of 16 or 8-bit types.

A simple example to demonstrate this, I think, is matrix multiplication. In Java (no vectorization possible without using the JNI) a simple (but partially optimized) implementation looks like this:

Multiplying 1000 x 1000 matrices containing random numbers takes about 6 seconds on my laptop. You can see how this will not work for machine learning calculations on matrices with millions of elements.

In Python, I can do the same multiplication using NumPy like this:

It takes about .02 seconds. Obviously this isn't a proper performance comparison, but the two order of magnitude difference is still illustrative of the power of vectorization.

Saturday, July 26, 2014

Java 7 Fork/Join and Java 8 Streams: Merge Sort

Fork/Join

The fork/join framework from Java 7 has been on my list of things to check out for a while now. Basically, it provides an easy way to parallelize work that can be broken up into smaller parts. Work is distributed to threads in a pool, and idle threads can steal work from busy threads (work stealing). There are many parallel algorithms and I decided to try parallel merge sort. I have two processors, so I would expect to see some speedup over the canonical top-down merge sort.


My fork/join parallel implementation does not become faster than the non-parallel implementation until I try to sort over about 25 million integers. It is obvious fork/join comes with significant overhead, but I suspect that would be less noticeable if I had more processors to work with.

Streams

As I have been trying to become more adept at functional programming and am also a Java fan, I have been starting to explore what's new in Java 8 as well. This probably is not at all practical, but I had the idea to try to re-implement merge sort using streams. To do this I had to use the non-recursive bottom-up merge sort algorithm.

This is faster than the fork/join version up to about 3 million elements, then it becomes slower. It was not really clear to me how to properly parallelize this one. Converting it to a parallel stream before doing the merge seems to only slightly improve performance. Again, maybe it is difficult to really see the benefit with only two processors.

It is also worth noting that since all parallel streams use the same fork/join thread pool, they will be affected by each other's performance if used concurrently as explained in Java Parallel Streams Are Bad for Your Health.

Notes

For completeness, here is the merge method I used in all three merge sort implementations:

Friday, July 25, 2014

Demystifying Java Dumps

I have never taken dumps of an application to help with debugging. I can think of a couple of instances when it would have been helpful, but at the time I did not know about them. Now I hear more about thread dumps and heap dumps and feel like they should be in my toolkit.

In this post I will just try to touch on what the dumps will contain, why they are useful, and how to get started interpreting them. This is by no means a comprehensive analysis. There are several different JVM and OS-specific ways to trigger dumps (signals, environment variables, JVM command-line options), their content varies by JVM, and there are multiple tools for analyzing them.

I am using 64-bit Oracle SDK 1.8 on Windows 7 because that is what I already had available.

Thread Dumps

thread dump is a snapshot of all threads in a process, including their stacks. It can help you diagnose problems like deadlocks. By taking several consecutive thread dumps you can also detect bottlenecks like multiple threads vying for the same lock.

To create my thread dump I simply ran Ctrl+Break from the window my server was running in.

To analyze my thread dump I tried out the Thread Dump Analyzer (TDA). I provided three thread dumps about five seconds apart and it found my deadlock:

Program that intentionally deadlocked two threads

Heap Dumps

A heap dump is a snapshot of the memory of a Java process, including everything in the JVM heap space. It is useful for figuring out why you are running out of memory.

To create my heap dump I set the -XX:+HeapDumpOnOutOfMemoryError JVM option, but to create one interactively I could also have used the jmap tool: jmap.exe -dump:live,format=b,file=<file name> <pid>

To analyze my heap dump I used the Memory Analyzer Eclipse plugin which shows things like:
  • a histogram showing the number of instances per class
  • dominator tree - a list of the biggest objects and what they keep alive (shown below)
  • top consumers - the most expensive objects grouped by class and package
  • classes loaded by multiple class loaders
  • memory leak suspects
  • top components - components bigger than one percent of total heap
Program that intentionally caused an OutOfMemoryError by creating a large list




Saturday, June 21, 2014

Exposing the Circuit Breaker Pattern as a JMX MXBean

Circuit Breaker Pattern

I first read about the Circuit Breaker pattern in Release It as a way to prevent cascading failures. Martin Fowler has also written about them in CircuitBreaker. The basic idea is that you protect all your operations that are likely to fail or timeout (remote operations) with a circuit breaker that monitors the operation. The breaker is initially in a closed state, but after a failure threshold is reached it opens, or trips. When the breaker is open, no calls are made to the remote resource. This prevents you from having too many callers waiting on a remote resource that is not going to respond. Too many callers waiting and hogging resources could have caused the failure to cascade to their part of the system.

After a period of time the breaker can try the operation again and close, or reset, if successful. It would be nice to be able to view the state and manually open or close the breaker though. This is where JMX comes into the picture.

JMX

Java Management Extensions (JMX) is a Java technology for managing and monitoring resources represented by MBeans. MXBeans are a special type of MBean that is usable by any client. Normally JMX is used in the context of monitoring an application server and the JVM it is running in, but applications are free to provide their own and register them with the application server's MBean server.

Circuit Breaker MXBean

I felt that Release It glossed over how to monitor a circuit breaker. However in the Java world JMX seems like an obvious choice, and I had never written my own JMX MBean, so I wanted to see how it would work.

First you need an interface containing the things that will be exposed through JMX. The getters and setters map to attributes and other methods map to operations.



My implementation of the interface is pretty basic and not well tested, but I think it captures the basic functionality you would want though. Notice it contains an execute method that was not in the interface. I did not think it made sense to expose execute through JMX.



Lastly I created a servlet that registers the MXBean and invokes the execute method with fake tasks.



Controlling the MXBean With JConsole

JConsole is a graphical monitoring tool for JMX resources. You can view and modify the MXBean's attributes and trigger the operations that are exposed. These changes are made without needing to restart the application or application server.


Update May 1, 2015

The e-book Migrating to Cloud-Native Application Architectures also covers circuit breakers and has code examples showing how to add them to a Spring Boot application by using the Netflix OSS Hystrix library. Interestingly, Hystrix also uses another fault-tolerance pattern from Release It called bulkheads by operating each circuit breaker in its own thread pool. This library collects metrics like traffic volume, request rate, latency, etc. and emits them as an event stream which can be aggregated and visualized by other Netflix projects. That's obviously better for a cloud application that what I did above.

Partly Cloudy Distributed Transactions

General Background

Distributed transactions are atomic transactions involving two or more resources, usually residing on separate machines. Each resource is transactional and there is also a transaction manager, like an application server, that manages the global transaction. The resources could be multiple relational databases, JMS queues, JCA resource adapters, or some combination of these.

The four ACID properties still apply to distributed transactions:

1. atomicity - a transaction is all or nothing
2. consistency - any transaction brings the database from a valid state to a valid state
3. isolation - concurrent transactions result in the same system state as transactions executed serially
4. durability - once a transaction has been committed, it will remain so

The terms "2PC" and "XA" seem to sometimes be used interchangeably with "distributed transaction" and I want to note the distinctions here. Two-phase commit (2PC) is a common algorithm for coordinating the participants of a distributed transaction. It ensures that even if part of the system crashes, the distributed transaction can still be committed or rolled back. The XA specification describes the interface between the global transaction manager and the local resource manager. All XA transactions are distributed transactions, but XA supports single-phase commit and two-phase commit.

To enable recovery from crashes and hardware failures, a transaction log containing the transaction history is written by the global transaction manager. When restarting it can use the log to replay in-doubt transactions and bring the system back to a consistent state. An example of an in-doubt transaction would be where the application server crashes after the first phase of the 2PC protocol (prepare) has completed, but before the second phase (commit) has completed. When the application server restarts the transaction can be completed based on the transaction log.

In The Cloud

Distributed transactions do not work well in the cloud for several reasons, some described here. My list is an attempt at summarizing that post with a mix of my own cloud experiences:

1. A node that was part of a transaction might be removed during down-scaling an never reappear.
2. Failures everywhere in the cloud are expected, so in-doubt transactions might be common because of network failures, network latency, or even the transactional resources being unavailable.
3. The application's file system is probably ephemeral so the transaction log needs to be stored in a database. As application instances are updated, if they are completely recreated, it will be difficult to associate a server with its transaction log.

An alternative to distributed transactions could be to make all operations idempotent. Then they could be retried at the application level without any problems. There are also other distributed transaction algorithms besides 2PC that potentially scale better.

Testing Transaction Recovery

It is not trivial to reliably create in-doubt distributed transactions because the transaction API UserTransaction only allows you to trigger the transaction beginning, the commit (2PC), and a rollback. What you really need is to stop the server part way through the commit, which means the hooks to do that will be vendor specific.

I found an easy to use tool called Byteman that runs as a -javaagent and can instrument, or insert extra code into a Java program at run time. If you know a class name and method name, say for the prepare method, you can have Byteman kill the server when the method exits like this.

Saturday, June 14, 2014

Ethical Hacking

At work I have been participating in a cyber security war games event where each week new challenges are posted. They focus on OWASP Top 10 web application security flaws. It has been fun because we get to actually exploit the flaws in an application hosted for the war games, instead of just reading about them or watching a presentation.

I think these types of vulnerabilities are something any web developer should be aware of, even if you do not deal with security on a daily basis, so when you are coding something that could be vulnerable you at least recognize it and can do more research, more testing, ask an expert, etc.

I have needed two tools to complete the challenges. First, I used Firebug to inspect HTML, JavaScript, and cookies. Second, I used the proxy server that is part of Burp Suite. The idea is that your browser sends requests to the proxy and the proxy forwards them on to the real destination, but before the proxy forwards them you have a chance to view/change the request.

Below are my notes plus more research I have done, mostly focusing on how to prevent them in the first place. After all, I am supposed to be learning how to write more secure code not how to be a hacker.

SQL Injection

A SQL Injection attack can occur when a web application accepts user input and uses it as part of a SQL query. If the input is not properly filtered, then the attacker can provide partial SQL instead of valid input and obtain information they do not have access to.

A common example is: ' OR '1'='1

The best way to prevent this type of attack is to use prepared statements combined with the principal of least privilege and white-list input validation.

Broken Authentication And Session Management

Broken authentication and session management can be exploited when, for example, user credentials are not encrypted (encoding is not the same as encryption) in an HTTP request or passwords are not properly hashed in a database or sessions do not timeout. These would lead to an attacker being able to access someone else's account.

There is not a single best practice here as there are many attack vectors, but OWASP provides detailed cheat sheets for authentication and session management that should be read in their entirety.

Cross-site Scripting (XSS)

An XSS attack can occur when a web application accepts user input and includes it in a page that will be seen by other users. If the input is not properly filtered, then the attacker can insert code that will send user credentials to them (like a form submitted to another website).

A common, demonstrative example is: <INPUT TYPE="BUTTON" ONCLICK="alert('XSS')"/>

The best ways to prevent this type of attack are to escape HTML, JavaScript JSON, CSS, and URL inputs.

Insecure Direct Object References

Insecure direct object references can be exploited by an attacker who changes a parameter value that refers to a system object, to a value for a different system object they do not have access to. This might mean changing a parameter in a URL or using the proxy to change data in the HTTP request.

The best way to prevent this type of attack is to use indirect object references and have the application map from the indirect object reference to the actual key. If you must expose direct object references, then verify, for each object reference, that the user is authorized for that object.

Cross-site Request Forgery (CSRF)

A CSRF attack can occur when a user, whom is authenticated with a third site, visits an attacker's site (or a site which was the victim of his XSS attack) and a hidden, forged HTTP request is submitted to the third site. Because the user is authenticated with the third site already, their browser will automatically send their session cookie along with the forged request and it will be indistinguishable from a legitimate request.

CSRF seems to be particularly effective when the user falls victim to the attack just by loading a page. The XSS code might not even be visible to the user. An image source from a URL would do this, or for a HTTP POST a hidden form will achieve the same thing:

<img src="http://example.com/app/transferFunds?amount=1500&destinationAccount=attackersAcct#" width="0" height="0" />

<form name="csrf" action="http://example.com/app/transferFunds" method="post">
<input type='hidden' name='amount' value='1500'>
<input type='hidden' name='destinationAccount' value='attackersAcct#'>
</form>
<script>document.csrf.submit();</script>

Preventing CSRF requires including an unpredictable token with each HTTP request. It must, at least, be unique to a session. It it usually included in a hidden field so it will be included in the body of the HTTP request, thus being less exposed than in the URL itself. A CAPTCHA can also be helpful in determining if a request came from a real user.

Insecure Cryptographic Storage

Like with broken authentication and session management, insecure cryptographic storage is a broad category
of application mistakes that can be exploited. Attackers can, for example, find encryption keys, gain access to a database that automatically decrypts data, or a weak encryption algorithm is used.

To protect against this kind of vulnerability, there are several steps to take:

1. Sensitive data should be encrypted if it is stored long-term, and the key should be stored separately
2. Strong, standard encryption algorithms should be used along with proper key management
3. Passwords should be hashed and a salt should be used
4. Never store unnecessary data
5. Infrastructure credentials like passwords in a config file should be protected with file system permissions

Failure To Restrict URL Access

An application that fails to restrict URL access gives anonymous users access to pages that should be protected. The attacker might guess an admin URL or even change the CSS in their browser to show something that was hidden.

To prevent these types of attacks, the application needs to always verify that the user is authorized and not just hide links and buttons that they are unauthorized for. Authentication and authorization policies can be role-based to minimize the effort of maintaining them. They should be easily configurable and not hard-coded.

Invalid Redirects And Forwards

Invalid redirects and forwards occur when an attacker links to a site's redirect page and specifies the redirect as a malicious site that contains malware or does phishing.

If you cannot avoid using redirects and forwards, then check that the supplied destination parameter is valid and authorized for the user. Destination parameters can also be a value that is mapped to a URL on the server.

Thursday, May 8, 2014

Web Sockets: Random Number Streaming

I have recently been playing with web sockets because I wanted to get a better understanding of how to use them. A web socket is a full duplex TCP connection between web browsers and web servers. This is not to be confused with earlier techniques with similar aims like long-polling and Comet. Web sockets actually have their own RFC. They are a standard. All the major browsers have implemented it. This means web sockets are not tied to any particular language or server. Below I will show using web sockets in plain JavaScript, Node.js, and Java.

Node.js Server

Here is a Node.js server that sends a random number to everyone connected to it, every second, on the same channel. Obviously this is not that useful, but it could represent a stock price for example. I found the ws module to be easy to use for the web socket server implementation. Another module socket.io seems to be more popular but you need to use it in the client as well. For this learning exercise I wanted to use native web sockets in my client.


Java Server

Here is the same random number sender implemented in Java. I ran this on the recently released WebSphere Liberty Profile 8.5.5.2 which added a web socket feature (full disclosure I work closely with this product at IBM).


JavaScript Client

My client can talk to both servers. I probably went overboard, but I used the same MVC-style from my Client-side MVC with JavaScript blog. The important part is lines 9 through 12.


A screenshot of the client receiving random numbers from the Node.js server:

Monday, April 21, 2014

Memoization: Fibonacci Sequence

Memoization is an optimization technique that caches computationally expensive function calls. I first learned of it in Stanford's Algorithms class on Coursera.1 It is more common in functional programming than object oriented programming because the function being cached can not have any side-effects, and that is, of course, often the case in functional programming. It is also commonly used by lazy data structures. The downside of memoization is that more memory will be used, but for some functions, as you will see below, it is well worth it.

A function that returns the nth Fibonacci number is the quintessential recursive function. I still remember that was the example used when I first learned of recursion. It is actually often implemented inefficiently, though, the standard recursive implementation being O(fib(n)) = 2n.

Python

Python makes memoization easy because you can decorate any function with a memoization function.

Scala

Scala does not make it that simple. Hopefully it will be built into the language in the future. This naive implementation only caches the original function call. The recursive calls do not know to use the memoized wrapper.


This blog correctly points out this problem with memoizing a recursive function and has a great technical explanation of the solution. In summary, you have to use something called a Y combinator to correctly implement the memoized wrapper function in Scala. It is basically a high-order function that returns a recursive version of a non-recursive function. I will defer to these other resources for a proper definition.


Notes

In Scala, for example, I could get up to n=45 in about 20 seconds with the standard un-memoized Fibonacci implementation. With the correctly memoized version I can run n=1000 in under 1 second because the memoized recursive implementation is O(fib(n)) = n.

The recursive Fibonacci implementation is not tail-recursive (which I covered in a previous blog), so you pretty quickly will hit a stack overflow error. In my Python implementation I even had to explicitly set the recursion limit higher than the default.

Also note the use of currying (which I also covered in a previous blog) in the nonRecursiveFibonacci Scala function.


1 I would highly recommend this class. It is better than the algorithms class I paid for to get my CS degree.

Tuesday, April 1, 2014

Fixing My Asynchronous Non-blocking I/O Callback Hell With Monads

In my previous post Asynchronous Non-blocking I/O Java Echo Server, I looked at non-blocking I/O from the Java perspective. My echo server used the callback functionality known as a CompletionHandler in Java 7, but it left something to be desired. Mainly I had a callback inside a callback inside a callback. Well, now Java 8 is released and I set out to fix it inspired by Monadic futures in Java 8: How to organize your data and avoid callback hell.

Monads surround computations and can take actions on them. They help chain computations together because they take in an action and then return a new monad surrounding the computations that had the action executed on them.

I am still trying to fully understand monads myself, so I am not going to act like I already do and give detailed information here. I think it's easier to understand from examples, anyway, instead of reading about category theory. The above definition will make more sense after looking at my new server code and seeing the chaining in action.

My New Server Code

A promise is a type of monad common around asynchronous operations. I used the new monadic CompletableFuture class to wrap my asynchronous computations.

My Thoughts

1. I like this better than my Java 7 version in terms of readability, but it would still be a lot cleaner without checked exceptions like in Scala.
2. This is not really non-blocking. Even though it is done asynchronously, I block waiting for a Future to complete three times. There is not a good relationship between Future and CompletableFuture to help avoid this as far as I can tell.
3. This Java 8 version is slower than the Java 7 version. My theory is that, even though the blocking calls are handled asynchronously by the CompletableFuture class, explicitly blocking the thread with the get() instead of letting Java handle it with a callback is what makes the difference.

There is not much Java 8 code out there to look at, so what would make this better? Can I get rid of the blocking get() calls somehow? Why is it slower than the Java 7 version?

Notes

A future represents a result that does not yet exist (and might not ever exist). Futures are read-only. A promise is the container that completes a future. Promises are writable.

UPDATE:

I posted a Scala version of this at Scala Asynchronous IO Echo Server.

Generators: Fibonacci Sequence

A generator returns a sequence of values, but unlike with a function the values are returned one at a time. This requires less memory for representing large (possibly infinite) sequences because the entire sequence is never stored in-memory at once. The sequence is computed, one value at a time, as needed.

Python


Scala

Exploring Tail Recursion

A tail-recursive function does not require a new stack frame for each call. All the recursive calls will execute in a single frame. This is benefiitial because you do not have to worry about recursing too deeply and getting a StackOverflowError, or the overhead of the many function calls.

What makes a function tail-recursive, and not just recursive, is that the call to itself is the last operation in the function. When this happens the compiler can generate bytecodes that are the same as if the function were written with a while-loop. These two examples from Programming in Scala are not only functionally equivalent, but would be optimized to have the same bytecode:


In Scala use the @tailrec annotation so you will get a compiler error if the function's recursion can not be optimized away.

Exploring Partial Functions, Partially Applied Functions, and Currying

I'm trying to learn more about functional programming. This is the third in a series of posts about functional programming concepts with examples in Scala (and in some cases other languages too).

Partial Functions

A partial function is a function only valid over a specific range. Some functions are not valid for all input values, like asymptotic functions or those only applicable to natural numbers.


Partially Applied Functions

A partially applied function is an expression in which you don’t supply all of the arguments needed by the function.


Currying

A curried function is applied to multiple argument lists. The results are chained together. In this simple example, the 2 would be applied to x + y resulting in 2 + y. Then the 3 is applied resulting in 5.

This technique can be used to make method calls look more like native-language support. Here is a non-trivial example from Programming in Scala:

Notes

Partial functions and partially applied functions are not related, except the names are almost the same. I was confused about this when I started to research this post.

The call to partially applied functions returns immediately. Curried functions return another function in the currying chain.

Monday, March 24, 2014

Exploring Immutability

In Java theory and practice: To mutate or not to mutate?, Brian Goetz says immutable objects simplify programming. One of the reasons is that they can be shared and cached without needing to be cloned, which means they are thread-safe if written correctly, which means you can more easily take advantage of multiple cores. In Scala you could write something like List(1, 2, 3).par.map(_ * 2) and map function is applied to the list elements in parallel. That would not work if someone else was changing the contents of the list at the same time.

Thread-safety, parallelization, it all sounds great, but one of the first things I think of is that if you had large immutable collections would there not be a lot of copying going on? There would be if every time you added or deleted an element the entire collection was recreated. It turns out that does not need to happen, though, because the collection implementation can exploit the similar (immutable) structure between the old and new versions. This might mean sharing a sub-tree or sharing the tail of a list.1,2 Compilers can potentially do additional optimizations with immutable objects as well.3

Like with anything else immutable objects do not solve all problems and can be used incorrectly. At the end of the day I see it as another tool in the toolbox. To look at how (un)natural it is to work with immutable collections, below are some Scala examples.


1 http://en.wikipedia.org/wiki/Persistent_data_structure
2 http://pragprog.com/magazines/2012-01/scala-for-the-intrigued
http://www.drdobbs.com/architecture-and-design/optimizing-immutable-and-purity/228700592

Sunday, March 23, 2014

Closure Examples

A closure is a function plus a referencing environment that is remembered from when it was created. Below I created a simple example in three different languages for comparison. It adds two numbers where the sum is capped at a certain max value. The function adds x and y, and the referencing environment contains the max value.

Python

Scala

JavaScript


In Scala and Python closures are probably seen more often when a lambda uses values from outside its scope. In JavaScript closures are commonly used to make private-like variables and functions that do not pollute the global namespace.

Saturday, March 8, 2014

Asynchronous Non-blocking I/O Java Echo Server

After writing a Node.js blog a couple weeks ago I wanted to revisit non-blocking I/O. I was first introduced to the topic in a college class where we were using Java, but that was quite a while ago, so I wondered what a non-blocking I/O server looks like in Java these days (and of course Node.js handles a lot of these details for you, so it is probably not the best place to learn about it). Non-blocking I/O was added in Java 1.4 (NIO) and then in Java 7 asynchronous non-blocking I/O (NIO2) was added.

I set out to write a simple echo server that can at least handle simultaneous connections so as to not be completely trivial. I would use the NIO2 APIs so that it is completely asynchronous to be most like the Node.js TCP echo server.

My Server Code


This code is disappointingly difficult to read. I think this is what JavaScript folks call callback hell: three levels of callbacks. Although if you can get through all the boilerplate and exception handling code, you will see I really only had to write a few lines of "real" code which is pretty nice. It is more concise than a similar server written using the older NIO APIs too.

Differences Between Blocking I/O, Non-blocking I/O, and Asynchronous Non-blocking I/O

With blocking I/O when a thread does a read or write it is blocked until some data is read or the data is written. The canonical Java web server example spawns a new thread for each request so that the main thread will not be blocked from accepting new connections.

With non-blocking I/O you request a read and you get what is available (maybe nothing is available) and the thread can continue on. You request a write and whatever can be written is written and the thread continues on.1 In other words, a single thread can manage multiple connections, but it might have to call read and write multiple times to completely read and write for each request and response.

With asynchronous non-blocking I/O an I/O operation always returns immediately--before anything is read/written--and the thread continues on. The I/O operation is handled in the background and the callback you provide eventually handles the result. This is how I wrote my server shown above.

These real-world analogies are helpful to at least understand the differences between blocking and non-blocking.

Conclusion

Non-blocking I/O is good when you need to manage many long-lived concurrent connections like a chat server or a P2P network. A use case where blocking I/O is probably better is one where more data is transferred at once, but connections are short-lived, like a traditional web server.1

It also seem like web sockets would be implemented with non-blocking I/O. Can anyone point me to more information about that?

Update

I posted a Java 8 version here.


Resources

1 more info at http://java.dzone.com/articles/java-nio-vs-io

Friday, February 21, 2014

Client-side MVC with JavaScript

Continuing the JavaScript theme from my previous two posts, there was one more JavaScript idea I wanted to investigate. I have to admit, I have been confused by what I've seen in the MVC space recently. It did not match up with what I thought of when I thought of MVC. Now, I'm not a web developer, this is not a topic that concerns me on a regular basis, but I feel like any self-respecting developer should have an understanding of these sorts of things.

Reading the Model-view-controller Wikipedia article actually made a light bulb go on in my head. See, I was stuck in the past and was thinking of all three components living on the server, but as client-side technologies have matured (read AJAX) the components have shifted to the client. This makes a lot of sense as MVC does not prescribe the exact implementation, rather MVC is a design pattern.


Whether it's client-side or server-side, you still have a controller that manipulates the model and a model that updates the view. With client-side MVC all three components would be written in JavaScript. With server-side MVC the model is a JavaBean, the view is a JSP, and the controller is a servlet. The latter is how I originally learned MVC and it is shown in the image below:


So what I decided to do was create a pure JavaScript MVC example whose model talks to the Node.js app from my last blog. There appear to be many existing frameworks for this, but they all looked way overkill for such a simple example.

The HTML


It simply has a few forms to trigger the different CRUD operations.

The JavaScript


I think it seems strange I need to refer to so many HTML elements in my controller. I could initialize the controller with the IDs it needs. That would be better, the controller would be less coupled. I still wonder if I separated my components correctly though. Is there a better way?

It's Alive

I ran everything on my laptop and dropped these files into a "public" directory at the same location as my my Node.js app. They are served as static files and this conveniently eliminates cross domain AJAX issues since both are at localhost.

The result of clicking "List All Wines":



Testing (in theory)

In my first blog I talked about how to get started testing JavaScript, so naturally I wanted to see if those principles could be applied here. It seems like the model and the view (with minor refactoring) would be easily testable. The AJAX requests to my Node.js app could be mocked as I demonstrated. I would say those components turned out really well. The controller would require more effort. I would have to pass the controller a mock view and mock model, but it's definitely possible. I think this conclusion gives merit to the design even though this was a simple example. I like the way it separated the unrelated parts of my code.

Tuesday, February 18, 2014

Node.js CRUD API

I have been wanting to check out Node for quite a while now. I am used to the more common OS thread-based concurrency model that we see in Java application servers. Node is instead based on an event-driven, non-blocking I/O model where all connections are handled in one thread.1 As long as the processing for each request is not CPU-intensive, it can handle thousands of concurrent requests.2

Another more obvious advantage of Node is that your application is written in Javascript. Anyone who has ever written client-side Javascript can immediately begin writing server-side code. You could also, for example, use the same input validation code in your client and server.3

Setup

Never having written a Node app before, I based my first one largely off this great blog Creating a REST API using Node.js, Express, and MongoDB by Christophe Coenraets. I hope to explore MongoDB more in another post. For now I will ignore it and just say that it is simple to install and Chritophe's code to talk to it works as is.

I ran my app on Windows and I did get tripped up on installing the Node MongoDB driver with npm. I had to follow these steps and run "set npm_config_arch=x86" before running "npm mongodb".

The Code

The server is pretty straightforward. Map each of your routes to a function that handles the request and listen on a specific port.


I updated the functions that talk to MongoDB to (I think) be more RESTful.4 The non-GET requests will respond with more meaningful HTTP response codes instead of JSON.


Node is able to handle multiple connections in one thread because of the use of callbacks. Any time you do something in your app that may take a long time you, tell Node what to do and provide a function to call when it is finished. It can handle other requests while the long running operation (like reading a file) is completing.

Testing

Bringing in a testing framework would probably be overkill for this simple app, so cURL will suffice. I got it for Windows here. An example for each of the routes:

curl -i -X GET http://localhost:3000/wines
curl -i -X GET http://localhost:3000/wines/<id>
curl -i -X DELETE http://localhost:3000/wines/<id>
curl -i -X POST -H "Content-Type: application/json" -d @new-wine.json http://localhost:3000/wines
curl -i -X PUT -H "Content-Type: application/json" -d @new-wine.json http://localhost:3000/wines/<id>



Tuesday, February 4, 2014

Unit Testing Client-Side JavaScript

I'm far from a JavaScript expert, but I've done some web development using JavaScript before and one thing I have always wondered about was how to write automated unit tests. Of course you'll probably still want to do manual testing and functional testing with something like Selenium. It seems like unit tests would greatly improve the development experience though. The thing is, I had no idea what a JavaScript unit test would look like, so I chose to investigate it as the subject for my first blog.

The advantages of unit testing JavaScript are the same as unit testing in other programming languages: looking for regressions is easier. It also promotes better code design because you can either refactor with confidence or are doing TDD and writing testable code from the start. And because our execution environment is the browser, unit tests make it incredibly easier to do cross-browser testing.

The basic requirements for being able to write unit tests are that your JavaScript is in separate files from your HTML and that you write granular, parameterized functions. Then you can call these functions from a test HTML file and run the tests by opening the HTML file in a browser. 

I'll quickly mention that I came across PhantomJS for running headless tests and an intriguing project called Karma for hooking your tests into your continuous integration process. I'm sure there are dozens of other great JavaScript frameworks too. Again, for this post I want to keep it simple and just go though some baisc examples of how to get started writing unit tests.

Getting Started With The Code

This is the test HTML file. Originally I wanted to do this with pure JavaScript, but it quickly became apparent that it would be easier to use an existing testing framework and not try to reinvent the wheel. I chose QUnit. Many of their examples use jQuery so I will pull that in too. It also seems like the subject of mocking would come up pretty quickly testing a site with AJAX calls and I would like to see how that work. I will use jQuery Mockjax for a mocking example. The purpose of this post is not to learn about or endorse any specific frameworks though. The examples below should be understandable without having used any of the frameworks before. I chose these three because they are light-weight and seem to be easy to get started with.

I include the js files for jQuery, QUnit, Mockjax (and json2 which it needs), and testable_js (my JavaScript code that I am trying to test). QUnit also needs a CSS file for displaying the results. Your JavaScript should only rely on the #qunit-fixture element. QUnit will reset elements inside the #qunit-fixture element after each test so that all tests can be atomic.

The JavaScript that I am testing is largely based on the QUnit Cookbook and this tutorial. I have some basic functions, a function that makes a HTTP request to Flickr, a function that makes a HTTP request to a not-yet-implemented page, a couple functions that modify the DOM, and a key logger function.

To test these I need to be able to test synchronous and asynchronous callbacks, simulate user actions (like pressing keys), the tests need to be atomic (since the DOM will be modified more than one), and I need to be able to mock an HTTP request (to the not-yet-implemented page).

Adding Actual Tests

Here I define a module with two tests. A module is just a logical grouping. Each test modifies the DOM and I check that only one change was made, so as I mentioned earlier they have to be atomic. This is also why suggested that parameterized functions are necessary.


Here I test my key logger by simulating a user pressing keys. This is obviously a contrived example, but I thought it was pretty cool nonetheless and is what convinced me to use jQuery.


Here I've mocked an AJAX call and have it return an error status code. The test is that my error handling callback is called. The test would fail after half a second if there was no response.


Here I've again mocked an AJAX call and have it return a success status code. The test is that my success handling callback is called. This is pretty cool and seemingly very useful too. You can set the response time, response body, headers, etc. too. Someone can then work on the front-end of a site before the back-end is finished.

Results

Look here to see my complete test HTML file. I included a few other tests and show how to do setup/teardown for a module which is another nice QUnit feature. Once finished when I open tests.html in a web browser this is the report I see:


It's still a big leap to go from these simple examples to testing a real site, but I learned that the barrier to entry for writing JavaScript unit tests is lower than expected and you can get going in a couple hours with existing opensource tools. I think I would try to include something like this in a future JavaScript project.