Archive for November, 2006

The Work Queue Pattern

Thursday, November 30th, 2006

In each of the Enterprise systems I've worked on, we've had background processes to massage data. They could be report aggregators, media renderers, or message processors. The features that they have in common are:

  • Work items may not be lost.
  • Work must not be interrupted.
  • Work items are independent of one-another.
  • Work items may not be repeated.

The pattern that best implements these features is a work queue. A work queue has a publish-subscribe queue of work items serviced by redundant processors. The durability of the queue ensures that work items are not lost. Processors are allowed to work in parallel since work items are independent. That redundancy ensures that work continues even if a processor goes down. And proper use of the queue ensures that no two processors get the same work item.

This problem could be solved with JMS or MSMQ, but in my experience these solutions imposed undesirable requirements. JMS marries your code to J2EE, while MSMQ requires a Microsoft server environment. I prefer to keep my options open. Fortunately, it is not difficult to implement a work queue directly in the database of your choice.

Here's my solution
First you have to define a work queue table. In addition to columns that describe the work to be done, this table must also have a column that identifies the processor that is performing the work. A foreign key to a processor table works well. The processor table has a primary key (the processor ID) and the name of the machine running the process.

When a processor is ready to perform some work, it must query the work queue table twice. First it looks for any available work item, and then it ensures that no other processor has claimed that work item. This two-step query is the most effective way I've found to detect and resolve collisions let me explain in more detail.

The first query is performed with the transaction isolation level set to "read uncommitted". This allows the query to see any work queue items that are currently being claimed. The result of this query is the work item ID that the processor will try to claim for itself.

SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED;
SET @id = (SELECT WorkItemID FROM WorkQueue WHERE ProcessorID=NULL);

The second query is performed with the standard transaction isolation level of "read committed". It both locks the record and verifies that the work has not yet been claimed.

SET TRANSACTION ISOLATION LEVEL READ COMMITTED;
SELECT ProcessorID FROM WorkQueue (UPDATE) WHERE WorkItemID=@id;

If the row is found and the processor ID comes back NULL, then you know that the work has not yet been claimed. You may now update the ProcessorID and commit the transaction.

However, if the row is not found, then the work has already been completed and deleted from the queue. If the processor ID is not NULL, then a collision has occurred. Another processor is already busy working on that item. So you just commit your transaction and try again.

It took several days of trial-and-error to get the pattern right the first time. But now that I have it, it is easy to create a work queue without the overhead of JMS or MSMQ.

Episode 2: Design Patterns

Tuesday, November 28th, 2006

Listen to the show

Charles Martin returns (http://webcudgel.com) and Raymond Jensen joins in (http://jensensoftware.com).

  1. Erich Gamma, et al. Design Patterns. a.k.a. Gang of Four: http://en.wikipedia.org/wiki/Design_Patterns
  2. Martin Fowler. Patterns of Enterprise Application Architecture: http://en.wikipedia.org/wiki/Martin_Fowler
  3. Stelting and Maassen. Applied Java Patterns.
  4. Steven John Metsker. Design Patterns in C#.
  5. Ron Jacobs. ARCast. Patterns and Anti-Patterns for SOA. http://arcast.net
  6. http://www.codeproject.com/
  7. Microsoft Patterns and Practices. http://msdn.microsoft.com/practices/
  8. http://www.patternshare.org/

Indexes in the File System

Sunday, November 26th, 2006

The Windows file system has become a jumble of mixed concerns. What started out as a way of organizing information hierarchically has mutated into something that second-guesses users at every turn.

Case in point. I finished building my wife's new machine and started copying photos from mine. I left the process running for a while, then went back to my machine to check on its progress. I found a message on the screen asking if I would like to overwrite the file "Thumbs.db". As Alan Cooper says, the computer had gone stupid. Instead of doing something useful, like continuing to copy the files that it could, it stopped and waited for someone who wasn't there to answer a dumb question.

This, of course, happened because Windows creates an index file whenever you switch to thumbnail view. Her machine had the folder open in thumbnail view so that we could verify that the pictures were indeed being copied. So her machine created an index file, and my machine tried to push a new one to it.

The root of the problem is that the OS is using my file system for something other than my files. My intent is to store pictures in this folder, but the OS throws something else in for its own purpose. Who does this folder belong to, anyway?

There are plenty of other examples of this. When you edit a Word document, the program creates a temporary tilde file in the same folder. This is used as a workspace while the program runs, and as a restore point if the program crashes. I once made the mistake of editing a Word document directly on a floppy disk (yes, it was that long ago) and removing the disk while editing. The program crashed soon afterward and the disk was left in an unusable state. I intended to put the disk back in prior to saving, but Word was using it for its own purposes.

Some source code control systems also exhibit this behavior. Visual Source Safe puts index files in all of your working folders. Subversion creates an entire subdirectory called .svn in each working folder. This makes it impossible to just check out a snapshot of the project, say for an automated build, and get a pristine copy. Or to check out of one branch in order to check into another. And what if, for some strange reason, a folder named .svn was actually part of my project?

Here's my solution
Just use your own database for information about a user's files. Leave the file system alone. That belongs to the user.

Or better yet, put the user's documents into your database, too. The file system is just a confusing jumble of metaphors for the average user. The less they have to see it, the better their experience will be.

But in any case, don't put indexes in the file system.

The Single-Queue Dispatcher Pattern

Saturday, November 25th, 2006

Today is Black Friday in the States: the day after Thanksgiving when the Christmas shopping season officially begins. I usually try to avoid stores on this day, but my wife needs a new PC, so we went to Fry's.

There was quite a crowd there as we were picking out components, but the experience was incredible. They had extra staff helping customers on the floor, and even more manning the registers. The line moved quickly and smoothly due to their excellent planning.

The system that Fry's uses is a single queue. One person stands at the head of the queue and dispatches customers to multiple registers. Unlike a grocery store, with its one queue per register, this system doesn't force you to chose your server prior to getting in line. Have you ever gotten behind the person with three price-checks only to see the people who queued up after you served first in the next line? That won't happen at Fry's.

The single-queue solution is so simple and elegant that it is my favorite parallel-service software pattern. It is applicable any time you have units of work concurrently flowing into a system, when those units of work are independent. Examples I've seen include HTTP requests, financial transactions, and remote procedure calls.

This pattern has three components. First is the queue, which holds units of work. Second is the dispatcher, which draws units of work from the queue. Third is the pool, which holds the processors to which the dispatcher assigns work.

The queue is a thread-safe container. It may be implemented in memory if it is OK for work items to be lost (e.g. HTTP requests, which will be retried). Or it may be implemented in a database if work items cannot be lost (e.g. financial transactions).

The dispatcher is a single thread in a tight loop. It blocks on the queue until a unit of work is available, then it blocks on the pool until a processor can be allocated. More sophisticated dispatchers may even grow and shrink the pool based on demand.

The pool is another thread-safe collection, this one of processors. A processor carries with it the resources it needs to do the work, such as threads or database connections. Allocation of these resources can be expensive, so processors are returned to the pool with their resource intact. Usually the pool is implemented as a stack, so that a small number of processors will be kept as busy as possible. Processors that sink to the bottom of the pool will go stale, and will be the first to be reclaimed by a sophisticated dispatcher.

This pattern is supported by Berkley Sockets. The accept() function blocks on a queue within the TCP/IP stack. The ideal use of this function is for a dispatcher thread to call it in a tight loop and allocate the sockets it returns to threads in a pool.

This pattern is implemented in .NET's ThreadPool. The QueueUserWorkItem method places a unit of work into a managed queue, and the .NET Framework dispatches that work to a pooled thread for processing.

There are countless other implementations of this pattern, from JMS to MSMQ. But even if one of these solutions is more than you need, you can implement the pattern yourself. It is simple, efficient, and effective.

Episode 1: User Interface Design

Tuesday, November 21st, 2006

Download the show

Michael is joined by special guest Charles Martin, a web designer who blogs at http://www.webcudgel.com. We discuss user interaction design as described in Alan Cooper's About Face 2.0: The essentials of interaction design. We talk about a poor user interface that we both worked on in the past, as well as a good one in the Hamachi VPN. We round out the conversation with a comparison of Windows vs. Mac.

You are Invited to a Talk Cast

Thursday, November 16th, 2006

Please sign up for a TalkShoe account, then come join me for a talk cast. We will start on Monday, November 20, at 9 PM Central (GMT-6). I'll host a 20-30 minute show every Monday, and post the high-quality audio for listening on-line or downloading for later.

This week, we will discuss user interface design. We will draw from Alan Cooper's book and personal experience. Come participate.

Data Aggregation for Reporting

Thursday, November 16th, 2006

A good way to design an enterprise system is to separate reporting from transaction processing. This lets the two processes run independently without interfering with one another. It also allows the database designer to optimize each component for its particular use.

When these two concerns are separated, the data must be moved from the transaction processing side of the business into the reporting side. In the process, data should be aggregated to keep storage requirements low and to make reports run quickly. There are two ways of accomplishing this: real-time and agent-driven.

A real-time aggregation solution updates the aggregate records as transactions are processed. For example, a book sale transaction might update a record that keeps track of the number of books sold by author. The author's record is either inserted or updated during the book sale transaction.

An agent-based aggregation solution, on the other hand, calculates an aggregate in the background after many transactions. For example, a query might run once a day to insert a count of books sold by author for that day.

There are advantages and disadvantages to each technique. Real-time updates provide the most accurate and useful reports, but they can interfere with the main business of the system. Agent-based aggregates don't block current transactions, but they can generate large batches of data that can be hard to move. Which method is right for a given situation?

Here's my solution
The best rule-of-thumb that I've found is to use the granularity as an indicator. Some aggregates are extremely coarse-grained, pulling all of the data into a small number of categories. Others are more fine, rolling up to a large number of very specific line items. Coarse-grained aggregates should be updated by a background agent, while fine-grained aggregates should be summed in real-time.

Coarse-grained aggregates create sums for a small set of keys. Usually, this key set is fixed, as in the categories of books or the locations of book stores. If the key set is small, then a real-time rollup would experience a lot of contention. The same small number rows would be updated many times by many concurrent threads. However, a delayed agent rollup would calculate each of those rows once by a single process. That makes the agent-based approach ideal.

Fine-grained aggregates create sums for a large set of keys. This key set grows over time because is based on the data itself. It might include individual books, authors, or customers. If the key set is large, a background agent would produce a large set of rows all at once. Large row sets cause problems with replication and blocking. But a real-time update would insert or modify one row at a time, leaving plenty of room for replication or blocking operations to interweave. Furthermore, if a user ID is part of the key set, collisions are extremely unlikely. User activity tends to be serialized, since one person can't usually do two things at once with the same application.

Separating reporting from transaction processing is an important part of an enterprise system. Consider the size of the aggregate key set when choosing how to go about moving the data.

Share the Authority

Monday, November 13th, 2006

In the typical database application, identity is created by an auto-increment key. Every insert to a table allocates a new number that uniquely identifies that new row. This number is used within the business logic of the application to return to that row in the future, and as foreign keys in other tables. It's a solution that works quite well.

That is, until it stops working.

One problem with this approach is that it fails to scale out. The one database server that allocates IDs becomes the one and only authority on identity. In order to create a new identity, any web server, any app server, and any integrated database server must abdicate to the one keeper of the auto-increment key. This one server becomes the bottleneck for rows in the table.

Database vendors have created ingenious solutions to this problem. It is possible with today's technology to connect a cluster of homogeneous servers and have them act as one big database. However, this solution is costly go build, tricky to configure, and difficult to maintain.

The problem (which we see time and again) is that our model has written a check that our technology can't cash. When we design our systems relationally, we assume that we can always go to one big table and that it will contain all of the rows that we need. If the customer in my system, she can be found in the one and only customer table. Since there is no other table that holds customers, the monotonically increasing ID is all I need to find her.

That assumption builds in a bottleneck that we later try to engineer out. In creating a cluster, we acknowledge that it can't really be one big table, but we still have to make it look like one. The relational model demands it.

But if we take a step back, we discover that we don't really need all of the promises that the relational model makes. Instead, we can live with a smaller set of promises that the technology can actually deliver.

Here's my solution
Distribute data across different databases. These databases agree in schema, but differ in data. Identify rows not just by their auto-incremented ID numbers, but by a compound key that also includes a database ID. Yes, go ahead and store that database ID in a separate column -- make that the first column -- of every table.

Use a connection factory in your applications that recognizes this database ID. Based on that ID, create a connection to the appropriate database. Give your operations team control over that configuration so that they can control the allocation of IDs to databases. Give them tools to move data from one database to another in order to split or merge databases as demand requires.

Joining across databases is often impossible and never recommended. So keep all detail rows in the same database as their master. Choose your pivot points wisely to avoid the need for cross-database joining. Database allocation should be coarse, such as keeping all records for one client or account in the same database. But when you need finer grained control, choose a relationship that you exploit only through presentation-tier linking or middle-tier business logic, never through data-tier joins. You can serve a link on a web page that leads to another database, but don't mix data from two databases on one page.

And always allow for many database IDs to refer to one physical database server. This will be necessary when underutilized databases are merged. It will also give your operations team the ability to plan ahead for future splits should they become necessary.

One server or cluster may be all you ever need, but by sharing the authority you give yourself options. The operations team may thank you in years to come.

Electronic Voting Software

Wednesday, November 8th, 2006

I used a Diebold terminal to vote in yesterday's election. All of the security issues aside (many of which have been covered elsewhere), I could not help but to critique their software as I was casting my ballot. I found that the system does some things well, but has a long way to go.

One of Alan Cooper's fundamental rules of user interaction design is that computerizing a real world metaphor makes it worse. You loose some of the features of the real world, and you take on none of the benefits of computerization. Case in point, the ballot presented on the Diebold machine was made to look exactly like a paper ballot. It had boxes for the offices organized in three columns. Candidates' names were listed within the box. The boxes flowed down one column and jumped to the top of the next.

This ballot design was originally created for the mechanical punch system that electronic voting replaces. Columns work well in that system because that allows pages to open down the center and guide your stylus to a punch hole.

But columns do not work as well on a computer screen. The resolution of a display is much lower than that of paper, so text is naturally harder to read. To compensate, the voting terminal used a very large font. This caused text that would easily fit on a column of a paper ballot to sit uncomfortably within the similarly sized column of the on-screen ballot.

On the confirmation screen, the same layout was presented in a scrollable view. Because the flow was top-to-bottom then left-to-right, I found myself scrolling through the first column, then jumping back to the top to scroll the second column. Scrolling is bad enough with a mouse, but dragging your finger down a touch screen is terribly uncomfortable.

A better metaphor for a computer screen is to take the entire width. Then logically group the offices so that an entire group fits on one page. Put tabs across the top of the screen to let you jump to any group, and put indicators on the tabs that are incomplete. There are advantages to columns on a paper ballot, but those advantages do not translate to the computer screen. I know that a committee determines the layout of the ballot, and that they have many legal and political constraints in which to work, but I didn't see any allowance given to the medium.

On the plus side, the check boxes had good affordance, meaning that it was obvious that you could tap them. They looked like they wanted to be selected, and they gave satisfying visual feedback when activated. A punch hole on a paper ballot is much smaller, although equally satisfying when pierced.

Eventually, all voting will be electronic. It's just going to take us some time to learn what paper metaphors to leave behind as we move to the digital medium.

Unused Icons on the Desktop

Tuesday, November 7th, 2006

As a geek, I like to retain control over my desktop (my computer desktop, that is; my physical desktop is a mess). So when Windows pops up and tells me that I have unused icons on my desktop, I usually ignore it.

Today, however, I decided to give it a chance. It identified two icons that I have never actually used (one of which was Network Neighborhood) and asked if I'd like to delete them. I said yes. Then instead of actually deleting them, it moved them into a new folder called "Unused Desktop Shortcuts" which it created ... on my desktop.

Windows weaseled out potentially useful feature for fear that it might make a mistake. They replaced two unused icons with one. The recycle bin is there for a reason. If I make a mistake, I can go there to undo it. Please don't add an interruption, three clicks, and extraneous thought to my day only to stop half-way.