Archive for July, 2006

Source code inconsistency

Thursday, July 27th, 2006

Earlier this week, I talked about an error message that appeared in Microsoft Visual Studio as a result of VSS integration. While the error message itself offered a few lessons for the software adventurer, I would now like to examine the core of the problem: VSS is not ACID.

If you are not familiar with the acronym ACID, please see this Wikipedia article. Basically, the idea is that you get to work on isolated snapshots of the data, with no interference from others who may be working concurrently. When you want to commit your work, you synchronize with others.

This is a fundamental flaw in the way that VSS works. Other source code control systems are ACID, like CVS, Subversion, and StarTeam, just to name three. Let's walk through a scenario to see why this is important.

Suppose you are working on a bug or enhancement and you have some files checked out. Now you need to change code in a file that you haven't checked out to complete your work. You open VSS and check out the new file. Or, if you are using VSS integration, you just start typing and Visual Studio automatically checks out the file for you.

But wait, someone else has modified the file and checked it in. Their change was part of a larger modification. They have also changed other files that it depends upon, including some that you currently have checked out. As a result, when VSS checks out this one file, thereby getting the changed version, it is incompatible with the other source files in your project. You will spend an hour or two getting your project to compile again, which may mean merging with the changes you've already completed.

The only way to continue working without interruption is to start making your changes from the version of the file that is already on your system. This ensures that you have a consistent set of source (the C in ACID), and keeps you isolated from other peoples changes (the I) until you are ready.

There is a way to do this in VSS, but it is tedious. First, turn off Visual Studio integration. Then, when you need to check out a file, go to the VSS file history window. Click on the version that you already have (you may have to click Diff to find it, or better yet create a label before starting your work). Perform the check out from history. When you are ready to check in, VSS will help you perform the merge, if necessary.

Better yet, invest in an ACID version control system. CVS and Subversion are free, but they take a little setting up. StarTeam and others cost money, but may offer the usability and features that you need. Whatever you choose, you will recoup your investment by avoiding interruptions due to inconsistency.

Visual Studio in a Zombie State!

Tuesday, July 25th, 2006

Never let a programmer write error messages. In our own software, we have one error message written by a former employee which reads, "Problem shutting down server! A cold blooded murder will now be performed." This, of course, indicates that the server is terminating the background thread.

I'm tragically amused when I see such error messages in software. Take, for example, Visual Studio. Open a project that is under source control. Make sure you have an older version of the project file than the tip. Now go into the project properties dialog and make a change. Visual Studio will check out the project file for you to edit, and in the process it will get the tip (a serious problem that I will blog about later). At this point, when you press OK or Apply, the following message will appear:

"Cannot access data for the desired configuration since it is in a zombie state."

Being a programmer, I think I can understand why my configuration might be in a "zombie state". I can also figure out that all I have to do is cancel the dialog and try again. This time, the project file will be checked out before I begin.

But while this error message may speak to the programmer who wrote it, it does not mean much to me.

Never underestimate the power of a log

Friday, July 14th, 2006

In the past, I haven't given logs their proper due. I can't count the number of times that I've been asked to diagnose a production problem, asked for log files, and received something completely worthless. Usually, I would dig through the code until I found a likely path that would produce the file observed. When this happens, I'm reading the log file forensically, rather than diagnostically.

Several times, the log file contains the events that have occurred, but not the variables that were in scope at the time. On one famous occasion, we found several logs of "Module not found" in the datacenter. If it had only said which module it was looking for, we could have easily fixed the problem. I finally changed the code and deployed a patch to production in order to capture the module name.

When capturing variables in a log, it is common to build a message through string concatenation. This leads to logs that are hard to search without using regular expressions. It would be better to separate the event text from the values, and use consistent naming for values. The latter helps you find all events that correspond to a particular piece of data.

Another common problem that I ran into was logging only when an exception was caught. This resulted in reams of code that are not logged. During normal operations, no exception is thrown. However, not all errors are manifested as exceptions. When trying to diagnose a business logic problem, all the clues we get are the incorrect outputs.

On the opposite end of the spectrum, I sometimes add log statements in sections of code that are exercised so frequently that the resulting file is unreadable. Even if the log contains the right information, it's useless unless you can find it.

Finally, effective log filtering requires discoverability. Operations must know what logs are available before they can turn them off or on. Too often, I've had to review code to discover the exact incantation to turn on a desired log. I've tried documenting all of the log options, but that document tends to change quickly.

Here's my solution

I tried the Patterns and Practices Logging Application Block, but it does little to solve the problems mentioned. Filtering is specified based on type, which is not granular enough for practice. If you turn off one log that appears too frequently, you risk turning off another that you need. And setting priority is not enough, either, because it's hard to decide at code time what the severity of each log will be at run time.

In Java applications, I've used Log4J. This works better, since you can filter based on class. But even that is not as granual as I would prefer. In addition, there is no support for capturing variables without contatenating a message.

I have created a logging infrastructure for our latest product that solves all of these problems. First, I borrowed the info provider idea from Patterns and Practices. An info provider captures a collection of name/value pairs and displays them in a consistent format. But I put my own little twist on the idea: I have an Add(name, value) method that returns this. That allows me to dot-chain info right in line.

Next, I allow log filtering down to the individual instance. A log instance corresponds to a line of code. No instance can be reused. But I do not actually log file names and line numbers, since these change from one build to the next. If we filtered based on actual line of code, the filter files would not be valid upon update.

Then, I collect all log instances in one place. That way, when the log line appears in business code, it does not contain any instrumentation details like message or severity. I like to separate the decisions of instrumentation from the decisions of business, even though these two worlds must intersect to some extent.

Capturing all log instances in one place also improves discoverability. The software itself can provide a list of all log options. This list is automatically updated as you release new software versions, and can be used programmatically to present a log filtering UI.

And one final feature of my logging infrastructure is causality. You can add logging with the using statement. Every log statement that occurs within the brackets is attributed to the outer one. With this structure, you can create hierarchical log files that follow an entire algorithm from beginning to end.

I wish I could post the code, but alas it belongs to my employer. Perhaps I'll recreate the system in an open source domain. Post a comment if you are interested in helping.

WordPress got it right

Wednesday, July 12th, 2006

I have just moved the Adventures in Software blog to its own domain. In this process, I had to install Apache, MySQL, and PHP. Each of those steps involved much arcania and Googling for answers before I found success. The community that has formed around these tools is great, but they are no substitute for straightforward information from the product vendor.

But when I got to the WordPress install step, I was pleasently surprised. All the information I needed was on one page. All of the configuration was in one file. The authors of WordPress have really though through the setup experience. They got it right.

Once installed, I though that I was in for a lot of copy and paste. Here again I was pleasently surprised. The forward thinking authors even provided a mechanism for porting my posts into my WordPress database. All I did was click a link, enter my user ID and password, and watch in awe as it did all the work right there in my browser.

Kudos to Ryan Boren, Matt Mullenweg, and all of the other people who make this possible. I just hope that my software can be as thoughtful as yours.

SOA Tight Coupling

Friday, July 7th, 2006

Service Oriented Architecture is a name for modeling an automated process as a set of services. This approach pushes back against the usual IT pattern of defining a central database that various programs interface with. SOA solves some problems, but I have found that it doesn't go far enough.

The problem with the monolithic central database pattern of traditional IT is one of dependency. All of the applications that interface with the central database come to depend upon its schema. And in order for them to communicate with each other, their schema dependencies must overlap. This makes it difficult to modify one application without breaking the others. As a result, the applications not only become too dependent upon the database, they also become too dependent upon each other.

The SOA solution to this is to integrate applications through contracts instead of through databases. Each application, or service as it is now called, defines a set of requests that it can process, and the schema for those requests. Services house their own databases, so that internal schema can change without affecting any other services.

This is an improvement, but I have found that it is not enough. In practice, SOA still leads to tight coupling. This happens in two ways.

First, a service -- a logical concept -- becomes tightly coupled to machines -- a physical concept. The service must be installed on a server or set of servers, and its database must be housed within that farm. Clients of the service are configured to send requests to that machine or farm. It is difficult to move a service to new hardware, or to start a second instance of a service within the organization.

Second, service contracts get tightly coupled to their implementations. Most contracts in practice are not negotiated among a group of service providers and clients. Instead, each contract is dictated by the service provider. In some implementation technologies, such as WSDL, a client must contact the service to discover the contract. And even when the contract was negotiated, as was ISO-8583 for financial exchange, it is often implemented inconsistently from one service provider to the next. The end result is that the client must know the particular service provider it is using instead of a general purpose contract.

Here's my solution
At the day job, we have a number of third-party integration solutions where we play the role of service provider. We reduce coupling by moving the integration point away from the network. Instead, we publish client-side APIs that talk to our servers. We use language constructs such as interfaces to define contracts. Language-based interfaces tend to be less fragile than protocol-based contracts because we have compiler validation.

We have to write a client implementation for each technology that is expected to use the service (e.g. both Java and .NET). But this is more than offset by the time saved by the consumer of the service. It is much easier for one party to ensure interoperability on both sides of the protocol. Just look at how easily a .NET web service client talks to a .NET web service, as compared to bridging .NET and Java.

If more than one service provider implements the interface, then the client can quickly swap one service for another. In fact, this switch can be made at run time, where some requests are sent to one provider and others to another provider.

This is only a partial solution, however. Our service is still tightly coupled to our datacenter. It would not be easy to move or split the data. But if we needed to, at least those concerns will be neatly hidden away behind a client-side interface.

Prefer asynchronous messaging

Wednesday, July 5th, 2006

We are in the middle of a rewrite of one of our main products. The current generation of the product has a large number of significant problems that manifest as support incidents and low customer satisfaction. One of the core faults behind these problems is the messaging infrastructure.

We developed a SOAP infrastructure before .NET was released, and before the ink on web service standards was dry. Our infrastructure was based on remote procedure calls (RPCs). This was great for development and worked fine in the lab. We could define an interface, create a proxy, and call it as if it were a local object.

When a real procedure is called, the thread transfers execution to a subroutine. When the subroutine terminates, the thread returns to the caller. Simple.

RPCs attempt to simulate the same model across machine boundaries. To do so, they block the client thread while the message is delivered to the server. The server processes the message and returns the result to the client. Only then is the client's thread resumed.

This works well, except when it doesn't. In a real procedure call, it is inconceivable that the thread will not make it into the subroutine. In an RPC, however, many factors conspire to interfere with the message. You do not have guaranteed delivery. Therefore, your RPC can't truly look just like a real procedure call. It has to be prepared for failures such as dropped connections and timeouts.

Furthermore, this approach does not scale. While the message is in transit and work is being performed on the server, the client's thread is blocked. And when problems occur, clients retry, and that forces the server to do the same work all over again. These retries escalate the problem, and cause an avalanche of activity during which the server is too busy to get any work done.

Here's my solution
Face it, the RPC model is broken. Don't use it. Don't allow your clients to expect an immediate answer to their queries. Instead, use asynchronous messaging. An asynchronous message does not return anything, so the caller doesn't have to wait for it to be processed. Furthermore, asynchronous messages can guarantee durable delivery, depending on the implementation you choose.

If you can, use a message queue. MSMQ now supports HTTP transport, so it is finally ready for the Internet. In the Java space, JMS is your tool of choice. But if your client's are too light-weight for a message queue, you still have options.

Web services are based on HTTP, which is intrinsically synchronous. However, they can be used asynchronously. .NET goes part of the way for you. It adds Begin and End pairs to your web service proxies. This at least keeps your threads free, but it offloads the waiting to the framework. Just look at what it does behind the scenes and you'll see that this isn't a full solution. Your requests are still serialized: if you begin methods A, B, and C, you will receive always the results in exactly that order, even if A is slow and C is fast. The methods still queue up and get submitted synchronously.

What you can do is to break your methods up. Have one method for delivery, and another for retrieving results. If the results are not ready, don't wait for them. Return immediately. That way, you can deliver messages in any order, and periodically check for their results without blocking. It sounds like more work on the server, but it is actually more scalable.

This is how message queues work under the covers. This is also the approach we are taking in our rewrite, since we can't afford a message queue on the client. It's more difficult to write the code this way, but test are already showing that it is a more stable solution. If you can use a real message queue, then you will have the best of both worlds.

TDD Test Drive

Monday, July 3rd, 2006

I admit to being a bit of a skeptic when it comes to agile programming methodologies. Back before XP was an operating system, I read Kent Beck's book with a highlighter and felt-tip pen in hand. Most of what I found there ran counter to my own experience.

Still, I found some good ideas that I have since incorporated into my daily routine. I do the simplest thing that works (though I have a high standard for what "works"). I check in unit tests and make them part of the build process. I refine my code through refactoring. And I have even engaged in pair programming with some success.

So when I saw Jean Paul Boodhoo on DNR.TV demonstrating Test Driven Development, I though I would give it a try. His presentation did nothing to convince me that TDD was a good thing. In fact if I had stopped there, I would have said it was an excuse to do sloppy work. However, my own experience with it has shown that it can be useful.

TDD as presented by JPB is a cycle of "red green refactor". First you write a failing test. Then you make it pass by whatever means necessary. Then you refactor your code to get rid of the ugliness that you had to add to make the test pass. I experimented with this cycle in my own work, and found that it quickly ground to a halt. But then, I added another step to the cycle and found myself in a better place.

My experiment was to build the message pump for our current project using TDD. The message pump is the background thread that pulls messages from a queue, sends them to a web service, receives messages from the web service, and dispatches those messages to business logic. It has to handle RAS\WAN failover scenarios, and use the phone line judiciously. The challenge here was to ensure that the code balanced these disparate concerns under various configurations.

I though this would be a better test for TDD than JPB's Model\View\Presenter example. JPB tested only the presenter, which in his case was simply a pass through from the model to the view. Not a very challenging piece of code. The message pump, however, has to communicate with four different peers: the queue, the phone, the web service, and the business logic. By necessity of requirements, it's a much more complex piece of code.

To isolate the message pump from the four peers that it touches (some of which were being developed in parallel), I used NMock2 to mock their interfaces. I used dependency injection to provide these interfaces to my production code. Some of the interfaces were at least partially defined, while others were initially empty.

I created the first unit test to do the simplest thing possible: start and stop the thread. It failed, I made it pass, and I found I did not need to refactor. I was off to a good start.

On the second test, I configured the pump for a WAN environment (no phone concerns) and queued a synchronous message. It failed, I made it pass, then I refactored. So far, TDD was working as advertised.

I added asynchronous messages next, and discovered that I should have done them first. After all, they are simpler than synchronous messages. In addition, I found that the interaction between these two concerns was causing my code to smell. I refactored for about half a day to correct this, but felt that I should have foreseen the problem.

When I finished the WAN test suite, I started the RAS test suite. I wrote the first test, saw it fail, and then set to work on making it pass. Here is where I hit a wall. The code that I had written for WAN was not well organized to support RAS. According to the TDD philosophy, I needed to refactor it to make it ready. Unfortunately, I didn't know what I needed to refactor toward. I spent the day chasing that wild goose.

Here's my solution
I solved the problem by going back to the whiteboard. This is how I work naturally, so I figured out how to work it into TDD. I drew the structure of the code I had written so far, then I added the new code. I applied the strategy pattern and defined a ConnectionStrategy base class, with WAN and Dialup concrete classes. Then I drew a flowchart (yes, I still use them) for the algorithm that used the strategy. Once I had planned all of my changes, I coded them.

I found that the whiteboard allowed me to explore my ideas from the top down, as I have always done. However, by refining my design for the specific purpose of RAS vs. WAN, my whiteboard design remained focused and grounded. I wasn't designing the whole system ahead of time. I was designing just enough to pass the next test. I used the whiteboard as a guide to writing code, and then I reflected minor changes made in the code back to the whiteboard.

So I added a step to the TDD cycle: Red, Redraw, Green, Refactor. With this approach, I get the best of both worlds. Top-down design and agility.