Saturday, December 26, 2020

Rust to WebAssembly and Fibonacci Numbers

WebAssembly (wasm) is a virtual assembly language. It's a binary instruction format that can be executed at near native speeds by JavaScript engines like V8, which means it can run in a browser or Node.js app. Not intended as a JavaScript replacement, it instead works with it for performance critical pieces of code. You can make calls from JavaScript to WebAssembly and vice versa. It can be useful for games, VR, and AR, for example. 

Rust is a fast and memory-efficient programming language with interesting type and thread-safety characteristics. I've never used it, but have wanted to check it out for a while. Along with Go and C, it can easily be compiled into WebAssembly. This post just touches the surface of Rust.

First, to learn how to use Rust in a JavaScript app, I followed Compiling from Rust to WebAssembly. They guide you through compiling Rust code, generating a JavaScript wrapper package, then using npm and webpack to run it. The end user only needs a modern browser and they can execute the code originally written in Rust none the wiser.

Building on that, as a simple experiment to see how much faster Rust code compiled to WebAssembly can be versus vanilla JavaScript, I compared calculating the nth Fibonacci number in each. I used the inefficient recursive algorithm intentionally, wanting it to be slow.

The function in Rust looks like this:

And the JavaScript code that calls it, times it, and compares it:

Surprisingly, although maybe it shouldn't be, the WebAssembly version of this simple function is orders of magnitude faster when executed in the browser. At n=35, where they really start to diverge, I'm seeing about .06 seconds versus 3.5 seconds. 

I don't often have to write CPU-intensive code like this in JavaScript, but with such a stark performance difference possible it's a good tool to have in the toolbox. 

Wednesday, November 4, 2020

GraphQL API Gateway Prototype

I've been meaning to check out GraphQL for a while. At work I'm seeing calls to REST APIs being made to get one small piece of data, most of the response discarded. Or, multiple HTTP requests, usually sequential, whose responses must be combined to do something useful with the retrieved data. Their granularity is wrong for the use case, but maybe not for someone else's. Can GraphQL help with this? It seems like it can. Model the business domain as a graph, like our mental models and object-oriented programming. Engines for many languages. A type system to enable good developer tools. It sounds great.

At the risk of adding another network hop, another moving, part, another layer, it would be nice to be able to call something to get exactly the data I need without changing the existing APIs. Mobile devices or slow internet connections could benefit from the reduced number of round trips. This extra layer could abstract away the different (read poor) uses of HTTP status codes and different response body styles. What I think I really want is a GraphQL API gateway.

The API gateway is a single entry point for all clients (the backends for frontends variation of the pattern has a different gateway optimized for each type of client). Requests can be proxied straight through to a single microservice or fanned out to several microservices. Responses can be aggregated and/or modified. This is the overlap point with GraphQL and why I think they would go well together. The API gateway can also centralize several cross-cutting concerns like throttling, routing, circuit-breaking, input validation, authentication (authorization stays in the business logic), etc.

Apollo is the GraphQL implementation I went with to prototype this. From their tutorial I started with the rocket launch API and extended it with a made up weather API to see how the two could be chained together. To the client, it's seamless. Both "services" are part of the same graph.

This is a lot of code to show in one shot, but I'll explain it below and then show some example GraphQL queries.

First, typeDefs defines the schema. You can query a list of rocket launches or a single launch by ID. dataSources specifies, of course, where the data is coming from, like a database or REST API. resolvers stitches these together. Notice how the Weather type takes a site from its parent, a Launch. This is how they are linked, or chained together.

With the Apollo server running you can try it out at http://localhost:4000/ in a browser. The GraphQL queries are on the left and the responses are on the right.


It's cool to see the different responses without having had to write any code to specifically handle them. A natural extension to this prototype would be to add mutation resolvers so clients could also update the graph.

Finally, the ThoughtWorks Tech Radar cautions against trying to create a universal, canonical, centralized data model. I think the bounded context ideas from DDD would apply. In their zero trust architecture blurb they also mention that a network perimeter isn't a security boundary anymore. That makes me question thinking of the API gateway as a place to shift all those cross-cutting concerns to. Do users have to go through the gateway or can they hit microservices directly? They mention service mesh as a solution and that would seem to be an API gateway alternative, but they aren't mutually exclusive either.

Saturday, June 13, 2020

Column-oriented Database Basics

This is a short post on column-oriented databases. I'll barely scratch the surface, but among the types of NoSQL databases—document, key-value, column-oriented and graph—I've always thought column-oriented was the most difficult to wrap my head around. Hopefully we can get past that initial hurdle here, run a few queries, and they will seem like less of a mystery.

A relational database is optimized for retrieving rows of data. This works well for transactional applications. A column-oriented database is optimized for retrieving columns of data. This works well for analytical applications and some queries, like aggregations, become really fast because much less data needs to be read from disk to retrieve the whole column. There seems to be a lot of overlap between column-oriented databases and data warehouses.

A relational database would store 3 rows of data like this:

1:a,b,c;2:d,e,f;3:g,h,i

While a column-oriented database would store the same data like this:

a:1,d:2,g:3;b:1,e:2,h:3;c:1,f:2,i:3

For a low-cardinality column, compression algorithms work very well. Something like a:2,a:3 becomes a:2,3. In some ways this is like normalizing a relational database to reduce data duplication, but I don't think that enables the same level of compression or gives you the same data locality benefits.

Of course column-oriented databases aren't good for all workloads. They aren't optimized for queries that touch many fields. Writes can also be slow since they aren't just appending to the end of a file.

So where do they fit into a big data architecture? When thinking about this I remembered the article How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh. It avoids mentioning specific products, but they way I interpret it is that their first generation was dedicated data warehouse products where you did ETL. The second generation was more ELT, maybe using Hadoop and Spark. The third generation, the eponymous data mesh, unifies batch and stream processing, perhaps adding Kafka to the above mix. Using the CQRS pattern, for example, a column-oriented database could fulfill some of the Q (query) operations as a sort of data warehouse.

For my first column-oriented database adventure, after that background research, I chose Amazon Redshift and I followed their Getting Started with Amazon Redshift guide. It only took about an hour to spin up a Redshift cluster, upload their sample data to an S3 bucket, copy that into Redshift, then run a few SQL queries.

The big surprise and takeaway was that this felt just like using a relational database. The data I uploaded was in text files (basically CSV files) and was used to create tables. The queries I ran were SQL queries that joined tables, selected different fields and aggregated the results. The details of how the database works under the covers is interesting, but doesn't inform its usage. There isn't much mystery after all.

SELECT * FROM event;









SELECT firstname, lastname, total_quantity FROM (SELECT buyerid, sum(qtysold) total_quantity FROM sales GROUP BY buyerid ORDER BY total_quantity desc limit 10) Q, users WHERE Q.buyerid = userid ORDER BY Q.total_quantity desc;

















A couple of other column-oriented databases you may have heard of are Cassandra and HBase. NoSQL databases are built to scale horizontally so you also need to consider how the CAP Theorem applies to your situation when choosing among them as they each make different trade-offs in addition to their unique features. The best choice will be highly data and workload specific.

Saturday, April 25, 2020

Dynamic Programming

Optimal substructure means that the optimal solution can be constructed from optimal solutions of sub-problems. Think recursion. When we combine optimal solutions to non-overlapping sub-problems the algorithmic technique is called divide and conquer. Think of a merge sort where each sub-sort can happen independently and they are combined at the end. When we combine optimal solutions to overlapping sub-problems the algorithmic technique we can use is dynamic programming. Because the sub-problems overlap, we can improve the running time by remembering the results of already computed sub-problems and not computing them again.

An easy way to visualize this is in calculating Fibonacci numbers. The standard recursive solution can be memoized to significantly improve time complexity at the cost of using more space to store sub-problem results. This is top-down dynamic programming—recursion and memoization. Top-down dynamic programming can be easier to program and anecdotally it is more commonly used.

Bottom-up dynamic programming is instead iterative and we solve sub-problems first, building them into bigger sub-problems. Without the overhead of recursive calls we don't necessarily increase space complexity, but still have the decrease in time complexity. For calculating the nth Fibonacci number this looks like:

This is time complexity O(n) and space complexity O(1), a huge improvement on the naive recursive solution that's O(2n) time and O(n) space. For details on big-O of recursive algorithms read this.

Some other interesting applications of dynamic programming are:

The knapsack problem, for example, has a top-down solution but I think the bottom-up solution is especially appealing. Using a matrix to store sub-problem solutions we can make the O(2n) time recursive algorithm O(nW) time and space:

But wait, there's more. We only need to remember a part of the matrix for the next iteration, which means we don't even need the matrix. It can be further optimized to only use O(W) space:

Not, in my opinion, an obvious solution by any means, but a very elegant one.

Saturday, April 11, 2020

Quantum Teleportation

I recently checked out the IBM Quantum Experience and was really impressed with the content available for getting started with quantum computing. There are simulators for executing quantum circuits, runnable both locally and in the cloud, but the highlight is being able to submit jobs to run on a real quantum computer. To create these circuits we use the Python DSL Qiskit.

For gaining a better understanding of Qiskit I recommend starting with the Coding with Qiskit video series. Following along with the videos and coding my first circuit then led down a rabbit hole of trying to understand what I was really building and how it could be useful.

Quantum computing uses phenomena from quantum mechanics like superposition and entanglement to perform computation. In classical computing we have bits which can either be on or off, 1 or 0. In quantum computing we have qubits which can be on or off at the same time, or somewhere in between. This is superposition. I like the coin analogy from Quantum computers will change the world (if they work) of this being a coin flip (classical) versus a spinning coin (quantum).

A qubit is represented as a vector of two complex numbers with length 1 where these represent the probability of a 0 or 1 state. Measurements destroy quantum state, collapsing the qubit to a classical bit. A measurement cannot be reversed. A quantum state cannot be copied like a classical bit's state can. Instead, quantum teleportation is the transfer of state from one qubit to another via a linkage known as entanglement. Even if moved far apart, entangled particles transmit information to one another. One coin flip mirroring another, linked flip.

The probabilistic nature of quantum computing enables many calculations to be processed simultaneously, thus making some intractable problems tractable. Quantum algorithms use superposition or entanglement to solve problems like factorization or search faster than classical algorithms. Qiskit has some amazing Jupyter notebooks showing how these can be used. Three I thought looked the most interesting, for example:
My first circuit is much simpler, of course, but I feel good about getting it running. Based on the videos linked above, I teleport a 1 state from one qubit to another and then measure it to confirm. I simulate this locally, in the cloud, and then run it on one of IBM's quantum computers. The actual quantum computer run includes some noise which is from the imperfect nature of today's quantum computers.



UPDATE:

TensorFlow Quantum is another library that could be used if you want to leverage Google's quantum computing resources.

Saturday, March 28, 2020

Geometric Brownian Motion

Geometric Brownian Motion is a stochastic process that can be used to model stock prices. It's related to random walks and Markov chains. This is a much different way to look at time series than what I explored in my Time Series Predictions post and given the recent market volatility it seems especially timely to take a closer look at it.

There are detailed explanations of the math and theory elsewhere. I've just written some code to see it in action. This generates 100 simulations for the S&P 500 ETF with the ticker SPY, modeling the past year:

What's especially interesting today is that the current value of SPY is on the extreme lower end of what we see simulated. Even increasing 100 to 1000 or more, we are still on a rare path.


There are some known problems with the model that may help explain this. First, volatility isn't really constant in financial markets. Second, the randomness in GBM is normally distributed but we know that stock returns are not. They have fatter tails, or higher kurtosis. Also stock prices react to specific geopolitical events that definitely are not random, even opening a day at a different level than the previous day's closing.

Out of curiosity, and because the second issue is the easiest to tweak, I replaced the normal distribution with a Laplace distribution and see a slightly wider dispersion of results (in both directions). In reality, though, the 52 week range is 218.26 - 339.08 so we still aren't capturing the extremes witnessed.


Any ideas on what's going on here? It must have something to do with the massive increase in volatility at the end of what was otherwise a calm year. Please comment.

Saturday, March 14, 2020

Time Series Predictions

Time series data is just a series of observations ordered in time. As simple as it sounds, there are important differences when analyzing time series data vs. cross-sectional data. This post will attempt to cover enough basics, from statistics and machine learning, to get to a point where we can forecast future observations.

First, some terminology. Data is autocorrelated when there are similarities between an observation and previous observations. Seasonality is when the similarities occur at regular intervals. Trend is a long-term upward or downward movement. And data is stationary when its statistical properties, like mean and variance, do not change over time.

I'll generate data with these characteristics to use for the rest of the post:



Detecting stationarity

While time series data is usually not stationary, stationarity is important because most statistical models and tests have that assumption. The Augmented Dickey-Fuller (ADF) test can be used on normally distributed data to detect stationarity. The null hypothesis is that the data is not stationary, thus you are looking to reject it with a certain level of confidence.

There are other (non-parametric) stationarity tests without the normally distributed data assumption that are beyond the scope of this post.

Transformations

By applying different transformations to our data we can make non-stationary data stationary. One approach is to subtract the rolling mean or weighted rolling mean (favoring more recent observations) from the data. A another approach is called differencing. Subtract the difference from some time period ago, like a month or a week, from the data.



Forecasting


Special care must be taken when splitting time series data into a training and a test set. The order must be preserved, the data can not be reshuffled. For cross-validation, it is also important to evaluate the model only on future observations so a variation of k-fold is needed.

SARIMA

Seasonal autoregressive integrated moving average (SARIMA) is a model that can be fitted via a Kalman filter to time series data. It accounts for seasonality and trend by differencing the data, however it is a linear model so an observation needs to be a linear combination of past observations. A log or square root transform, for example, might help make the time series linear.



RNN

A recurrent neural network (RNN) with long short-term memory (LSTM) is an alternative to SARIMA for modeling time series data. At the cost of complexity, it can handle non-linear data or data that isn't normally distributed.



I didn't put a lot of effort into tuning these models, or coming up with additional features, and they aren't perfect, but we can start to get a feel for how they work. The SARIMA model looks underfit. It did, however, nicely ignore the randomness in the data. The RNN model clearly overfits the data and more work would be needed to get a smoother curve.

This was my first attempt at working with SARIMAX and RNNs so any feedback is appreciated.