Archive for September, 2009

Table as queue

Wednesday, September 30th, 2009

To play along at home, please download the script for SQL Server 2005. It may work on earlier versions, but I haven't tried it.

Yesterday I said that the data is the queue. It is a mistake to separate the processing of data from the data itself. I proposed an architecture in which there is no workflow orchestration, there are only autonomous worker processes. These processes pick up work based on attributes of the objects on which they act, not based on any external queue.

There are a small number of database patterns that make this possible. One that I often use is Table as Queue.

Setup a test environment
I created a test database called "Queues" to test this pattern. You can find the script attached. As it is now, this script is incorrect. We will correct it as we go along.

The database has two tables. The Order table holds orders that we want to process. If the order hasn't been processed, it has no confirmation number. If it has, it does. The Log table holds a log of all the steps the order processor has gone through. This will help us see how well we are doing.

There are three stored procedures that we'll use. The InsertLoop stored procedure generates some test data. Run this one first. The ProcessLoop stored procedure simulates the ERP process. You'll want to open two query windows and enter "EXEC ProcessLoop" in both of them. Finally, the ViewLog procedure lets us see how we did.

Anatomy of a queue procedure
The ProcessLoop stored procedure begins by selecting one order to process. It returns only the ID of this order.

-- Get a work item
SET @orderId =
  (SELECT TOP 1
	OrderId
	FROM [Order]
	WHERE ConfirmationNumber IS NULL)

This ID has to be the primary key of the table. This is necessary for correct queuing behavior. If it is not the primary key, we will not get the correct locks later.

To process the order, other details about that record will be required. These should be fetched separately by the returned ID. Do not try to combine these steps.

Default isolation level
Right now, the ProcessLoop stored procedure uses the default transaction isolation level (read committed). It starts a transaction, gets an order from the queue, processes it, and updates the confirmation number.

Hit F5 on both of your ProcessLoop query windows. While this is running, run ViewLog. You'll see something like this:

OrderID Action SPID Date/Time
89 Processing 52 2009-09-29 12:51:18.163
89 Processing 53 2009-09-29 12:51:18.337
89 Updating 52 2009-09-29 12:51:19.163
89 Finished 52 2009-09-29 12:51:19.163
89 Updating 53 2009-09-29 12:51:19.337
89 Finished 53 2009-09-29 12:51:19.337
90 Processing 52 2009-09-29 12:51:19.663
90 Processing 53 2009-09-29 12:51:19.837
90 Updating 52 2009-09-29 12:51:20.663
90 Finished 52 2009-09-29 12:51:20.663
90 Updating 53 2009-09-29 12:51:20.837
90 Finished 53 2009-09-29 12:51:20.837

Both of the SPIDs are processing all of the orders. This is clearly not what we want. We want only one SPID to process each order.

Higher isolation level
The other process is able to sneak in between the SELECT and the UPDATE and perform an UPDATE of its own. We want to ensure that the row that the process UPDATEs is the same version it SELECTed. The way to do this is to raise the isolation level to REPEATABLE READ. As the name implies, it ensures that two SELECTs within the same transaction will read the same data. What it also means is that an UPDATE in the same transaction will update the same version of the row. No intermediate UPDATEs will be allowed.

Make the following change to the ProcessLoop procedure:

SET TRANSACTION ISOLATION LEVEL REPEATABLE READ

Then start the two ProcessLoops again. Before long you will get the error "Transaction (Process ID 53) was deadlocked on lock resources with another process and has been chosen as the deadlock victim. Rerun the transaction." The other process will still be running.

Kill the remaining process. Close the window so that you can commit the transactions that got left open. Then open a new window and type "EXEC ProcessLoop" again to prepare for the next step.

If you look at the log, you'll see only one of the SPIDs represented. The logs from the other one were rolled back with its transaction.

What happened is that both SPIDs SELECTed the same record. The first one UPDATEd it, which prevented the second from doing so. When it tried to UPDATE, SQL recognized the deadlock and rolled it back.

Update lock
What we want to do is lock the record so that the first one to SELECT it is the one to process it. We'll do this by escalating our read lock to an update lock.

-- Get a work item
SET @orderId =
  (SELECT TOP 1
	OrderId
	FROM [Order] WITH (UPDLOCK)
	WHERE ConfirmationNumber IS NULL)

Now run the two processes and see what happens.

OrderID Action SPID Date/Time
138 Processing 52 2009-09-29 14:05:26.173
138 Updating 52 2009-09-29 14:05:27.173
138 Finished 52 2009-09-29 14:05:27.173
139 Processing 53 2009-09-29 14:05:27.173
139 Updating 53 2009-09-29 14:05:28.173
139 Finished 53 2009-09-29 14:05:28.173
140 Processing 52 2009-09-29 14:05:28.173
140 Updating 52 2009-09-29 14:05:29.173
140 Finished 52 2009-09-29 14:05:29.377

At first glance this looks good. Each order is processed by just one SPID. But look closely at the times. The two processes are taking turns. Each one enters SELECT, but then waits for the other one to commit its transaction before SELECT returns.

Since we've requested an update lock, the higher transaction isolation level makes no difference. We can turn it back down to the default.

Read past
We want each process to skip over locked rows. We do this by specifying the READPAST query hint.

-- Get a work item
SET @orderId =
  (SELECT TOP 1
	OrderId
	FROM [Order] WITH (UPDLOCK, READPAST)
	WHERE ConfirmationNumber IS NULL)

With this hint in place, we get some good behavior.

OrderID Action SPID Date/Time
583 Processing 53 2009-09-29 14:37:21.740
584 Processing 52 2009-09-29 14:37:22.020
583 Updating 53 2009-09-29 14:37:22.740
583 Finished 53 2009-09-29 14:37:22.740
584 Updating 52 2009-09-29 14:37:23.020
584 Finished 52 2009-09-29 14:37:23.020

Now the two processes are running concurrently. Each order is handled by only one process. There are no deadlocks.

With this pattern, the data itself can be used as a queue. But you have to remember the details:

  • SELECT TOP 1 just the primary key of the queue table.
  • Use both the UPDLOCK and READPAST query hints on the SELECT.
  • Fetch additional information by ID in a separate query.
  • After operating on the data, update the record to exclude it from the original WHERE clause.
  • Perform both the SELECT and the UPDATE within the same transaction.

The data is the queue

Tuesday, September 29th, 2009

Our eCommerce system uses BizTalk to process orders. The BizTalk orchestration keeps track of the status of the order. Queues keep track of communications between BizTalk and external systems, like the ERP. The database is updated at all points in the process so the web site reflects the current order status.

This system is fragile.

Problem 1: Diagnostics are scattered
When something goes wrong, there are different ways of observing the system. The customer observes it from the web. They call support to report a problem. Support looks at the system in the Business Activity Monitor (BAM). They call ops. Ops observes the system through Health and Activity Tracking (HAT). If an activity is waiting on the receipt of a message, they will then inspect the appropriate queue. Each of these places holds part of the story.

Problem 2: Things get out of sync
In order to fix a problem, ops needs to take action at some level. They might cancel an orchestration. They might delete a message from a queue. They might manually enter an order into the ERP. When they do, the other parts of the system don't know about the change.

Problem 3: Blocking
One process is querying by status; another is updating it. One process is working on a request; another needs to skip it and go to the next. The workflow problem seems to attract blocking behaviors.

Problem 4: Distributed transactions
When the orchestration moves from one step to another, it needs to update the order status. Before it can wait on the receive queue, it must push to the send queue. It would be disastrous to perform one of these operations without also performing the other. So the BizTalk message box, the order database, and the outgoing and incoming message queues must be accessed within the same transaction. This brings the Distributed Transaction Coordinator into the picture, increasing fragility and exacerbating blocking behaviors.

Here's my solution
There is no need to manage workflow separate from data. This is the fundamental flaw in all workflow systems: BizTalk, WF, MSMQ, Service Broker, just to name a few. They all have their own data storage separate from the subject of the workflow.

If we relied upon the Order table instead of using BizTalk and external queues, we would have a more consistent system. We could diagnose the entire workflow from the database. Having one shared system of record, services would never get out of sync. Certain patterns could be used to eliminate unnecessary blocking. And all transactions would be local.

Broken down in this way, the system no longer has an overarching orchestration. You cannot open a designer and see the workflow. Instead, it has a number of autonomous processes that each move a unit of work further along the pipeline. Each process does its work in the same database.

There is one cardinal rule that makes this possible. The action that a process performs must also remove it from its queue.

A process runs a query to get a work item. It then performs some action on that work item. The outcome of that action is going to be a single database command. That command, whatever it is, must render the work item ineligible for the query.

It is tempting to perform two database commands in the same transaction: record the results of the action and update the work item status. Avoid this pattern. This allows work items to become out of sync. Instead, design one database command that both records the result and disqualifies the work item from the query.

Confirmation number
Let's look at one unit of the order processing orchestration. Take the part of the system that submits the order to the ERP system. This process sends an order to a web service, and receives a confirmation number. So what if we designed this process to use the following query?

SELECT TOP 1 OrderId, OrderDetails FROM [Order] WHERE ConfirmationNumber IS NULL

And then recording the results of the action was:

UPDATE [Order] SET ConfirmationNumber ? WHERE OrderId = ?

This single UPDATE statement disqualifies the order from the SELECT statement. That means that the process will move the unit of work forward, but it can never be out of sync.

Unfortunately, this SELECT ... UPDATE pattern is prone to blocking. We'll look at some patterns that satisfy this cardinal rule without blocking in future posts.

Survey Sez

Monday, September 28th, 2009

We just did a review session in Q.E.D. Hour. For our review, we divided the group into two teams and played a game of Survey Sez.

I searched in vain for a PowerPoint version of the game that met my requirements. Unfortunately, while the sub culture of PowerPoint game developers has produces several versions of the popular TV show, none of them worked the way I wanted them to. So I wrote my own using a tool better suited to the problem: WPF.

Survey Sez iz best played using a wireless number pad. You press the numbers 1-9 to reveal answers. And you press Enter and Backspace to navigate through the questions. The other keys on the number pad will eventually control other game options, like incorrect answers, bonus rounds, and game winning animation.

I posted the source code as-iz on CodePlex. Future versions will look better and include more game features. Eventually, I may even add some social networking aspect, like Twitter integration. After all, every system evolves until into a Twitter client.

If you would like to contribute, please post a comment. We'll talk about what you can work on.

Homework assignment: Fix the Socket API

Friday, September 25th, 2009

Your task is to fix the .NET Socket API. This is as unhelpful as an API gets. There are several rules that a user of a Socket must follow. Create an API that proves that they are following all of the rules.

Here is the relevant part of the existing unhelpful API:

public class Socket : IDisposable
{
    public Socket(SocketInformation socketInformation);
    public Socket(AddressFamily addressFamily, SocketType socketType, ProtocolType protocolType);

    public Socket Accept();
    public void Bind(EndPoint localEP);
    public void Connect(EndPoint remoteEP);
    public void Connect(string host, int port);
    public void Dispose();
    public void Listen(int backlog);
    public int Receive(byte[] buffer);
    public int Send(byte[] buffer);
}

Here are the rules as documented in MSDN:

  • You cannot use a Socket returned from Accept to accept any additional connections from the connection queue.
  • You cannot call Accept, Bind, Connect, Listen, Receive, or Send if the socket has been closed.
  • You must call Bind before you can call the Listen method.
  • You must call Listen before calling Accept.
  • Connect(string host, int port) can only be called if addressFamily is either InterNetwork or InterNetworkV6.
  • Connect cannot be called if Listen has been called.
  • You have to either call Connect or Accept before Sending and Receiving data.
  • If the socket has been previously disconnected, then you cannot use Connect to restore the connection.

You may use as the basis of your proof the concepts that were discussed in previous Q.E.D. articles, including:

Please post responses in the comments, email them to mperry@adventuresinsoftware.com, or post links on your own blog. I will share my solution next week.

Relaxed locking on thread-safe cache

Friday, September 18th, 2009

In a prior post, I showed how callbacks can be used to prove prerequisites and thread-safety. The example was a cache. But the solution left the cache too constraining. While it was correct, it was too tightly locked. All users of the class will line up behind one another, whether they are after the same data or not.

To resolve this problem, let's move the lock. Instead of locking on the cache while fetching the value, let's lock on the key. Two clients accessing the cache with the same key will line up, but clients with different keys will not.

But we can't lock on the key itself. We need to lock on something with identity. The identity of the key is not important. A client can create a new instance of the key having the same value, and that should qualify as the same key. So we have to look elsewhere.

Fortunately, the value of the key is used to find an object in the dictionary. That object does have identity. So we move the lock there.

public class CachedObject<TValue>
{
    private TValue _value;
    private bool _fetched = false;
    public DateTime DateCached { get; private set; }

    public CachedObject(DateTime dateCached)
    {
        DateCached = dateCached;
    }

    public TValue Value
    {
        get { return _value; }
        set { _value = value; _fetched = true; }
    }

    public bool Fetched
    {
        get { return _fetched; }
    }
}

public class Cache<TKey, TValue>
{
    private TimeSpan _expiry;
    private Dictionary<TKey, CachedObject<TValue>> _store =
        new Dictionary<TKey, CachedObject<TValue>>();

    public Cache(TimeSpan expiry)
    {
        _expiry = expiry;
    }

    public TValue Get(TKey key, Func<TKey, TValue> fetch)
    {
        CachedObject<TValue> found;

        lock (_store)
        {
            // If the object is not in the cache, or it is expired,
            // fetch it and add it to the cache.
            DateTime now = DateTime.Now;
            if (_store.TryGetValue(key, out found) &&
                found.DateCached + _expiry < now)
            {
                return found.Value;
            }

            found = new CachedObject<TValue>(now);
            _store.Add(key, found);
        }

        lock (found)
        {
            if (!found.Fetched)
                found.Value = fetch(key);
            return found.Value;
        }
    }
}

The Get method locks first on the cache. This protects the data structure, and ensures that we get just one instance of CachedObject. Once it has that, it locks on the found CachedObject so that clients after the same key line up.

Inside the lock, it checks to see whether the value has already been fetched. If not, this thread is the first in line, so it fetches the value. But if it has already been fetched, then the thread was not the first in line and can just use the fetched object.

The above is not a formal proof, and has many logical holes. Hopefully we'll have time to shore it up, and possibly find bugs in the code along the way. This is the kind of code that definitely needs to be proven. Sure, unit testing can give us some confidence, but no amount of unit testing can verify that this code is completely thread safe.

Continuous testing does not require static analysis

Thursday, September 17th, 2009

Ben Rady is the author of Infinitest, a continuous testing tool. This is an Eclipse or IntelliJ plug-in for Java that runs unit tests as you edit code. The key differentiator of Infinitest over other continuous testing plug-ins is that it runs only the unit tests that need to be run.

Ben describes how he accomplishes this feat. He uses static analysis of both the test and the code under test to determine whether tests need to run. Unfortunately, he's doing more work than he needs to. Continuous testing does not require static analysis.

Update Controls
I moderate an open source project called Update Controls for .NET. This project does data binding for WPF, Winforms, and Silverlight. It has absolutely nothing to do with a continuous testing plug-in for Java.

Or does it?

Update Controls automatically discovers the data that your properties depend upon. When that data changes, those properties are updated. You might think that it's doing some sort of static analysis to discover dependencies, but it's much simpler than that.

Say you have a dependent property "A". As the property getter for "A" is executing, Update Controls is watching. When another property called "B" is referenced, Update Controls says "Aha! A depends upon B." It's that simple.

Why does this work?
Most code is deterministic. If you call it providing the same inputs, it will produce the same outputs. Furthermore, it will do so in exactly the same way. So execute some code and it see what set of properties it reads. Execute it again, and it will still read the same set of properties.

Now, change a property not in the set that was read. Does the behavior of the code change?

No! The code only observed one set of properties. It doesn't care what happens to any other properties. It will still read that same set, and it will still produce the same results.

OK, this time change a property that is in the set. Does the behavior of the code change?

You bet it does. In fact, it may change drastically. It may even read an entirely different set of properties next time. So all bets are off. Once you change a property upon which it depends, you have to dump the entire set and run the code again. Then you can gather up a brand new set.

Back to Infinitest
So, Ben, my advice to you. Drop the static analysis. Instead pick up a code coverage tool. Look at all of the lines of code covered by each test. If any of those lines changes, you have to rerun the test. If not, you don't. Simple as that.

Thread safety through callbacks

Wednesday, September 16th, 2009

While callbacks are good at proving prerequisites, they are also good at proving thread safety. Take, for example, this cache:

public class CachedObject<TValue>
{
    public TValue Value { get; private set; }
    public DateTime DateCached { get; private set; }

    public CachedObject(TValue value, DateTime dateCached)
    {
        Value = value;
        DateCached = dateCached;
    }
}

public class Cache<TKey, TValue>
{
    private TimeSpan _expiry;
    private Dictionary<TKey, CachedObject<TValue>> _store =
        new Dictionary<TKey, CachedObject<TValue>>();

    public Cache(TimeSpan expiry)
    {
        _expiry = expiry;
    }

    public TValue Get(TKey key)
    {
        CachedObject<TValue> found;

        // If the object is in the cache, check the date.
        if (_store.TryGetValue(key, out found) &&
            found.DateCached + _expiry < DateTime.Now)
        {
            return found.Value;
        }
        else
        {
            return default(TValue);
        }
    }

    public void Put(TKey key, TValue value)
    {
        _store.Add(key, new CachedObject<TValue>(value, DateTime.Now));
    }
}

The caller has to use it properly for it to work. They have to call Get before Put (prerequisite), and they have to synchronize around the two calls. If they don't, they run the risk of creating a race condition.

A race condition occurs when the outcome of an operation depends upon other operations happening in parallel. If one thread gets finished first, there is one result. If another thread finishes first, there is another. In this case, two threads could try to Get the same value. Finding none, they can both race to fetch the value and put it in the cache.

To prove that a race condition cannot occur, we can use a callback.

public class Cache<TKey, TValue>
{
    private TimeSpan _expiry;
    private Dictionary<TKey, CachedObject<TValue>> _store =
        new Dictionary<TKey, CachedObject<TValue>>();

    public Cache(TimeSpan expiry)
    {
        _expiry = expiry;
    }

    public TValue Get(TKey key, Func<TKey, TValue> fetch)
    {
        lock (_store)
        {
            CachedObject<TValue> found;

            // If the object is not in the cache, or it is expired,
            // fetch it and add it to the cache.
            DateTime now = DateTime.Now;
            if (!_store.TryGetValue(key, out found) ||
                found.DateCached + _expiry >= now)
            {
                found = new CachedObject<TValue>(fetch(key), now);
               _store.Add(key, found);
            }
            return found.Value;
        }
    }
}

This version of the cache takes a callback to fetch the value if it is not found. Synchronization is now built into the Get method. The caller doesn't need to synchronize around it. More importantly, we can prove the correctness of the code without seeing the caller. There is no way to get it wrong.

This cache still has problems. The lock is to broad. In preventing a race condition on two thread getting the same value, we've inadvertently blocked two threads getting different values. We'll fix that next.

Fixing an unhelpful API

Tuesday, September 15th, 2009

I started Q.E.D. Hour at work to discuss ways that we can use mathematical proof to have more confidence in our code. In last week's session, we walked through some improvements to a particular piece of code. Here is how it started:

public void DoWorkWithOrderService()
{
    using (ITransaction currentTransaction = new AcidTransaction(_container))
    {
        OrderService.Transaction = currentTransaction;
        OrderService.DoWork();
        currentTransaction.Commit();
    }
}

It is only possible to prove the correctness of this code based on the caller (seen here), and not the order service itself. If the caller forgets to initialize the transaction property, if the caller forgets to create the transaction in a using statement, or if the caller forgets to create a new instance per thread, then bad things can happen.

Unhelpful_API The order service is an unhelpful API. When you approach it, you have to use caution. If you use it in the wrong way, bad things will happen. If you're lucky, it will throw an exception. If you are unlucky, you won't catch errors until you release to production.

An unhelpful API must be approached with fear.

Parameters are prerequisites
The first step in improving this unhelpful API is to ensure that a transaction exists before a service method is called. We can do this by turning the property into a parameter.

Before:

public class Service : IService
{
    public ITransaction Transaction
    {
        set;
        get;
    }

    public void DoWork() { }
}

After:

public class ServiceFactory : IServiceFactory
{
    public IService CreateService(ITransaction transaction)
    {
        return new Service(transaction);
    }
}

public class Service : IService
{
    private ITransaction _transaction;

    public Service(ITransaction transaction)
    {
        _transaction = transaction;
    }

    public void DoWork() { }
}

You cannot get a service from the factory without providing a transaction. Therefore, you cannot make any service calls before providing a transaction. Q.E.D.

Callbacks are prerequisites
We have proven that a transaction is provided before a service method is called, but we haven't yet proven that the transaction gets disposed. We can take one more step toward making this a helpful, provable API using a callback.

public class ServiceFactory : IServiceFactory
{
    private ITransactionFactory _transactionFactory;

    public void CallService(Action<IService> action)
    {
        using (ITransaction transaction =
            _transactionFactory.CreateTransaction())
        {
            action(new Service(transaction));
        }
    }
}

A transaction factory is injected into the service factory (via constructor not shown here), and that is used to create a transaction as needed. That creation happens within a using statement in the service factory, not in the calling code. There is no way for the caller to forget to do this. Thus we have proven that the transaction gets disposed.

In these two steps, we have transformed an unhelpful API into a helpful one. The unhelpful API offered many options, most of them wrong. The helpful API offers only the options that are correct. It guides the caller in its proper use. It cannot be used incorrectly.

An unhelpful API cannot be proven. It must be approached with fear. A helpful API can be proven. It can be approached with confidence.

No public TypeConverter class?

Monday, September 14th, 2009

I just saw this error message:

'IEnumerable' type does not have a public TypeConverter class.

It referred to the following XAML:

<ListBox Grid.Row="0" ItemsSource="Contracts">
    <ListBox.ItemTemplate>
        <DataTemplate>
            <TextBlock Text="{Binding Name}"/>
        </DataTemplate>
    </ListBox.ItemTemplate>
</ListBox>

Do you see the problem? I forgot the {Binding} markup extension. Sometimes the solution does not fit the error message.

Proof based on imperative programming

Monday, September 14th, 2009

Our Q.E.D. homework for last week was to prove some code to be correct. In particular, prove that:

  • The service has a transaction when it is called.
  • The transaction is disposed.
  • The code is thread safe.

The code in question was:

public Order GetOrder(int orderId)
{
   return InvokeGetOrder(() => OrderServiceProvider.GetOrderWithItems(orderId));
}

public Order InvokeGetOrder(Func<Order> function)
{
    Order order;
    using (IBusinessTransactionContext currentBusinessTransactionContext =
        new AcidBusinessTransactionContext(_container, false))
    {
        OrderServiceProvider.BusinessTransactionContext =
            currentBusinessTransactionContext;

        order = function();
        OrderServiceProvider.BusinessTransactionContext.Commit();
    }

    return order;
}

The proof of the first two points is pretty straight forward. Because the transaction is assigned prior to calling the service function, we know the service function has a transaction. And because the transaction is created in a using statement, we know that it will be disposed.

The third point is harder to prove. If two threads call GetOrder on the same object, they will both try to assign a transaction to the same OrderServiceProvider. We have to prove that this won't happen. For that, we look at the Castle configuration:

<component
    id="...OrderService"
    service="...OrderService, ..."
    type="...OrderService, ..."
    lifestyle="transient>
        <parameters>
            <orderRepository>
                ${...OrderRepository}
            </orderRepository>
        </parameters>
</component>

Because the OrderService is declared to be transient, we are guaranteed to get a new instance each time the component is resolved. Each web request resolves the service, so we know that each web request gets a unique object. The request is thread safe. Q.E.D.

Perils of imperative proofs
While these proofs are correct, there is a significant problem. We proved correctness based on the code that calls OrderService, not based on OrderService itself. In fact, there is nothing about the OrderService API that compels the caller to use it correctly.

For example, the service requires that we set the BusinessTransactionContext property before we call a method. If we forget one little assignment statement, then the service does not have a transaction. This problem can be caught no earlier than integration testing.

If, on the other hand, we forget to create our transaction in a using statement, we have a more subtle problem. We may see that database connections leak over time. Or we may see blocking because failed transactions remain open instead of being rolled back. This problem won't be caught in integration testing, and may be very difficult to diagnose when it finally is identified.

And what if we forget to configure our components to be transient? The only time this causes a measurable problem is when we have concurrency greater than 1. This doesn't happen during a typical integration test. Nor does this usually happen during manual testing. The first time a web application sees concurrency is usually during stress testing, or maybe even production.

This lead into a discussion of prerequisites. There are ways that we can rewrite the OrderService API so that it cannot be used incorrectly. Some examples are to require a parameter and to use the constructor. We'll apply these techniques to the OrderService in the next post.