The Train-Switch Pattern

I've often heard server maintenance compared to changing the tires on a truck going 80 miles an hour. You can't take down the datacenter to install a new build. You have to do it on the fly.

J2EE application servers like JBoss are supposed to be able to do this. However, in my experience, I've never seen it work reliably. We've had to fall back on using the load balancer to divert traffic away from one web server at a time in order to perform the upgrade. This is a costly and difficult chore. And if something goes wrong during the operation, you are without backup.

Since this is such a difficult problem, I've rewitten it -- in the true spirit of mathematics -- into one that is easier. The computer system is not one big monolith, but many discrete transactions that each have a beginning and an end. No longer are we working on one huge semi, we are working on several small train cars, all hurtling down the same track at incredible speed.

The load balancer solution diverts the traffic without waiting for (or causing) a lull. But the problem with the load balancer solution is that it is performed on the socket boundary. Good socket optimizations -- like pooling, asynchronous communications, and multiplexing -- favor long-lived sockets. This means that the load balancer tends to take a long time to quiet the traffic to one machine. Sockets are also intrisicly hardware-oriented. A socket addresses an IP, which the router maps to a specific NIC. Any socket-based switching solution would place unnecessary constraints on the hardware required to run it. You can't, for example, scale it all the way down to 1 server. What we need is a switch on logical rather than physical boundaries.

Here's my solution
I have started designing into my solutions a train-switch at the transaction boundary. As a request comes in (not a socket, but an individual request), I'll have an infrastructure component parse it and prepare it for the business logic. Then, it will look in a business logic registry to obtain the pointer to the current handler. That request is forwarded to the handler and the infrastructure listends for more requests.

What makes this a train-switch is that the infrastructure also has a background thread waiting for updates. I could detect these updates as new DLLs or JARs in a special folder, but I prefer explicit notification. When the update is available, the infrastructure loads it and gets it ready. Then, in one quick synchronized action, it replaces the pointer in the registry. Now all new requests are directed toward the new version of the business logic.

What happens to the requests already in progress? They continue unharmed. The train-switch is only at the head of the track. Both business logic components can run in parallel. The switch is made while the system is up with absolutely no interruption.

There are a few things that you need to do to make this possible. First, avoid singletons. Singletons are convenient, but you can't have two different versions of a singleton running in parallel. Instead, use dependency-injection. Instead of an object getting the instance of a singleton that it needs, give that object a pointer to its service provider. That way you tell it explicitly which instance it depends upon, and two can run in parallel with no problem.

Second, business components must be self-contained, except at the boundaries. If a business component reaches outside of itself for some dependencies, then you can no longer upgrade those dependencies. There are only two boundadies that the request may cross: the business entry point and the data access layer. Define an interface for the business entry point, and use dependency-injection to give the business logic access to the database connection pool.

Third, you need to have one entry point into the application. You can try to use the train-switch pattern to swap out socket listeners, but then you are back to the problems with the load-balancer solution. The socket boundary is bigger than the problem requires, so you will have old code running longer than is necessary. Also, the listener switch cannot be performed atomically. If you take down the first listener and then put up the second, there is a chance that a socket connection will be refused. If you try the opposite order, you will find the address already in use. The only way to achieve an atomic train-switch is to have one consistent listener that lives outside of the more volitile business logic.
Fourth, define a loose interface for the business logic. Since the listener cannot be replaced, its contract with the business layer cannot be changed. So define a contract that is flexible enough to handle changes to business needs. Property bags work well on this boundary; strict business method or document definitions do not. You should definitely have a strict contract between the client and the server, but the train-switch is not the place to enforce it.

I'm not sure why applciation servers like JBoss fail so miserably at reliable "hot-swapping". The container implements the listener and calls servlets on request -- not socket -- boundaries. And servlets define an extremely flexible interface. Perhaps they have problems with external dependencies. At any rate, I find it best to inject dependencies across interface boundaries and implement the train-switch pattern myself.

2 Responses to “The Train-Switch Pattern”

  1. earwicker Says:

    "I’ve often heard server maintenance compared to changing the tires on a truck going 80 miles an hour"

    They just say that cuz it sounds macho. It's really just a little bit of work on Neurath's Boat.

  2. Michael L Perry Says:

    Or is it the Ship of Theseus?

Leave a Reply

You must be logged in to post a comment.