Saturday, June 21, 2014

Partly Cloudy Distributed Transactions

General Background

Distributed transactions are atomic transactions involving two or more resources, usually residing on separate machines. Each resource is transactional and there is also a transaction manager, like an application server, that manages the global transaction. The resources could be multiple relational databases, JMS queues, JCA resource adapters, or some combination of these.

The four ACID properties still apply to distributed transactions:

1. atomicity - a transaction is all or nothing
2. consistency - any transaction brings the database from a valid state to a valid state
3. isolation - concurrent transactions result in the same system state as transactions executed serially
4. durability - once a transaction has been committed, it will remain so

The terms "2PC" and "XA" seem to sometimes be used interchangeably with "distributed transaction" and I want to note the distinctions here. Two-phase commit (2PC) is a common algorithm for coordinating the participants of a distributed transaction. It ensures that even if part of the system crashes, the distributed transaction can still be committed or rolled back. The XA specification describes the interface between the global transaction manager and the local resource manager. All XA transactions are distributed transactions, but XA supports single-phase commit and two-phase commit.

To enable recovery from crashes and hardware failures, a transaction log containing the transaction history is written by the global transaction manager. When restarting it can use the log to replay in-doubt transactions and bring the system back to a consistent state. An example of an in-doubt transaction would be where the application server crashes after the first phase of the 2PC protocol (prepare) has completed, but before the second phase (commit) has completed. When the application server restarts the transaction can be completed based on the transaction log.

In The Cloud

Distributed transactions do not work well in the cloud for several reasons, some described here. My list is an attempt at summarizing that post with a mix of my own cloud experiences:

1. A node that was part of a transaction might be removed during down-scaling an never reappear.
2. Failures everywhere in the cloud are expected, so in-doubt transactions might be common because of network failures, network latency, or even the transactional resources being unavailable.
3. The application's file system is probably ephemeral so the transaction log needs to be stored in a database. As application instances are updated, if they are completely recreated, it will be difficult to associate a server with its transaction log.

An alternative to distributed transactions could be to make all operations idempotent. Then they could be retried at the application level without any problems. There are also other distributed transaction algorithms besides 2PC that potentially scale better.

Testing Transaction Recovery

It is not trivial to reliably create in-doubt distributed transactions because the transaction API UserTransaction only allows you to trigger the transaction beginning, the commit (2PC), and a rollback. What you really need is to stop the server part way through the commit, which means the hooks to do that will be vendor specific.

I found an easy to use tool called Byteman that runs as a -javaagent and can instrument, or insert extra code into a Java program at run time. If you know a class name and method name, say for the prepare method, you can have Byteman kill the server when the method exits like this.