Archive for March, 2007

Dirt Simple XML Parser

Friday, March 30th, 2007

I know there are already far too many ways to parse XML. Softartisans has an entire page devoted to choosing which XML parser is best for different situations. But still I find myself facing the same trade-off each time I need to consume an XML document: do I do something simple, or do I do something efficient? DOM is easy, but very costly. SAX is the most efficient, but very difficult to use. Other technologies fall between those two on the spectrum.

I've finally solved that problem once and for all. I created an XML library that makes SAX dirt simple to use. You express an XML document declaratively, putting in hooks to handle elements that interest you. This declaration turns into a DefaultHandler for SAX to invoke. The source code is available in two parts: mallardsoft.xml depends upon mallardsoft.util.

To build a parser, create a new NestedHandler and initialize it with a new Document. Using in-line dot-chaining, set up sub elements of the Document with new Element objects. Here's a quick example:

final StringBuffer firstName = new StringBuffer();
final StringBuffer lastName = new StringBuffer();

DefaultHandler handler = new NestedHandler(new Document()
  .one("people", new Element()
    .zeroToMany("person", new Element()
      .init(firstName)
      .init(lastName)
      .optionalAttribute("firstName", firstName)
      .requiredAttribute("lastName", lastName)
      .end(new InvokeHandler() {
        public void handle() throws SAXException {
          newPerson(firstName.toString(), lastName.toString());
        }
      })
    )
  )
);

Pass this to a SAX parser and it will extract all of the person elements, grab their first and last names, and call the newPerson() method for each one. You can nest elements as deeply as you need to. And if you need a recursive declaration, use an ElementSpec object. Give it a try with a more complex XML file and see how well it scales up.

Update
I've changed the names to better reflect true XML nomenclature. The proper names are not "root" and "node", but "document" and "element".

AiS 19: Engineering Processes

Tuesday, March 27th, 2007

Listen Now

Carl Honore, author of In Praise of Slowness. Less is More
http://ted.com/tedtalks/tedtalksplayer.cfm?key=c_honore

The goal of any engineering process is to get a group of people on a complex project to work together. But the process should be as simple as possible. The process should never itself become the goal.

The process should facilitate communication. Put gates into the process in those places where information is handed off from one individual or team to another. The process should also promote progress. Give people ownership, responsibility, and the right to say "no". This empowers them and gives them pride in their work.

The key features of the Handmark Server Process are:

  1. Define use cases to extract requirements.
  2. Create an implementation plan to decide what needs to be done to implement the use cases.
  3. Track changes in order to build a deployment plan.
  4. Aggregate the changes to produce a build.
  5. Review the build in operations prior to deployment to ensure that it is re-runnable.

One Exception per Release

Monday, March 26th, 2007

At Handmark, we are on a weekly release cycle with the OnDemand product. Each week, we fix bugs, provide new features, or generally improve the product in some small way. In addition, we eliminate a bit of log noise each cycle.

We use Log4J to capture interesting events. Some of these events are more interesting than others. The severity level of each event is supposed to determine just how interesting it is, but in practice there is little correlation. Severity is decided at code time, when the level of interest is unknown. Sometimes we catch and log an exception as an error, but it actually comes up quite often and doesn't cause much trouble. We don't want to turn the filter up just in case you miss a real error, so we live with "spammy" logs.

For each release, we identify one of these spammers and eliminate it. If there is truly a problem, we'll fix it. If it's just a benign occurrance, we'll lower the severity of the log event to INFO. Either way, we've improved the system by getting noise out of the log that impedes our ability to diagnose real problems.

Here's my solution
I have a bash script that helps identify problems in the logs. It is useful for both diagnostics and for finding log spam. I call it "gather":

#! /bin/bash

egrep -A1 'Exception|ERROR|WARN|Caused by:' | tr -d 0-9 | sort | uniq -c | sort -nr

To use this script, I get into the log folder and execute a command like this:

cat server.log.2007-03-26-* | /home/mperry/gather | less

This lists the unique errors, warnings, and exceptions in decreasing order of frequency of occurance. Things near the top of the list are usually problems that need fixing, or spam that needs cleaning.

AiS 18: Project Management

Wednesday, March 21st, 2007

Listen Now

Berkun, Scott. The Art of Project Management

Project Management Institute

We invided Handmark project manager Allen Davis to join the panel in this week's discussion. What's the most important skill that a project manager has? Is it planning? Giving power-point presentations? Working with Microsoft Project? No, it's facilitating communication. The project manager usually does not have the authority to make a project team do what they need to. Force doesn't work. So he sometimes has to rely on persuasion to keep the project moving. He has to communicate with development, infrastructure, business analysis, and management teams, and to help those teams to communicate with each other.

What's the most important thing that a project manager brings to the team? A schedule? A methodology? Donuts? No, it's a vision. Sometimes he has to give that vision out a little at a time. Sometimes he has to share more of it with some team members than with others. But the project vision more than anything keeps the team members motivated to work together and to give their best efforts. And it is a standard, better than any Gantt chart, against which progress can be measured.

Meetup with Cali Lewis

Sunday, March 18th, 2007

We met Cali and Neal from Geek Brief. You can catch the interview with my daughter Kaela on Episode #145.

Start a Source Code Repository

Friday, March 16th, 2007

Most of us have side projects going on at home. These help us to hone our skills and learn new technologies, as well as to explore other domains that catch our interest. Sometimes our hours spent at night and on the weekend are an effort to get out of our day job. Software professionals tend to be software hobbiests.

Whether your side project is your cubicle escape strategy or just for fun, you should create your own source code repository. Obvoius benefits include a backup of your work, a log of changes, access to past revisions, and portability between machines. Just running VSS on your desktop machine can satisfy these needs. But I recommend that you take it one step further. Put your source code repository out on the Internet so that your friends can share your code.

I use SVN Repository to host all of my side projects (I have three going right now). Their small business plan runs me $7.00 a month and gives me all the services I need. Sure, I run a small business, but this plan is good for individuals too.

If you want to have show-and-tell in the office, you can quickly add an account for each of your friends. They will be able to browse the code from the web, or download it using an SVN client like Tortoise SVN or Subclipse. One of my friends was so intrigued by the show-and-tell that he offered to contribute to the project. Hosting the repository online made this possible, where a local VSS database would have been difficult to share.

AiS 17: Open Source Software

Monday, March 12th, 2007

Listen Now

The panel continues the buy vs. build discussion with relation to open source software.

AiS 16: Buy vs. Build

Wednesday, March 7th, 2007

Listen Now

Should you invest your money in off-the-shelf software, or write it all yourself? The full panel weighs in.

Is Gemalto’s NIM Secure?

Sunday, March 4th, 2007

In my two most recent security posts, I talked about USB key solutions to Internet security. A comment from Schlum led me to contact Gemalto about a USB product that they promise will make your on-line experience secure. Their YouTube video doesn't give enough information to make any sort of recommendation, so I sent them an email. I asked what differentiates their product from the True Crypt/Portable Firefox key that I carry. Here is their reply:

Hi,

Thank you for your email.
The TrueCrypt seems to be a portable Password wallet and is primarily aimed at user convenience and not security.

The NIM is for now a issuer deployed (by your bank, or stock trading portal or any issuer who needs to make sure that the users are securely logging into their portals and can mutually authenticate them).

There can be multiple issuers who could use the same device avoiding a necklace of devices effect.

The key difference between a password wallet and the NIM is that the NIM is not prone to man-in-the--middle or phishing attacks

I hope that they make a deal with my bank soon, so I can truly evaluate this product. But until then, I don't have any specific information to go by. So I can only make the following general precautionary statements.

Do not use your True Crypt or any other USB key on an untrusted computer. Any USB key that does not have its own processor is no more secure than a floppy disk. Any program on the host computer can read and write data on the USB key. If you mount an encrypted drive, then it becomes clearly visible to any program on that machine. And if you have to enter a password, like you do for True Crypt, then a key logger can steal the password. Just like Gemalto says, the True Crypt/Portable Firefox USB key is for convenience, not for security on unknown systems.

Any device that lacks a processor to perform encryption and decryption must share the key with the host. If the host is compromised, then the key can be stolen. Similarly, any device that lacks a keypad must collect the password from the host. If that host is compromised, the password can be lifted.

Just remember when using any USB solution: the computer sits between you and your key. A man-in-the-middle attack doesn't have to come from outside the system.

Special Characters in MySQL Replication

Thursday, March 1st, 2007

MySQL uses command-based replication. That means that it sends the actual insert, update, or delete command down to the replicated database, instead of sending the data itself. Most database systems ship the actual transaction log, which ensures that the databases stay in sync. But MySQL's choice to use command-based replication causes a few problems.

One of the most insidious problems has to do with characters in text fields. In order to put special characters in a text field, you have to escape them. You can't just put a apostrophe in a text field, because the text itself is set off by apostrophes, or single-quotes. So you have to put a backslash in front of it. That tells the database to take the apostrophe as a literal.

This works fine for the master database. And in fact, JDBC and other database drivers will do this for you when you use prepared statements. However, this fails with MySQL's command-based replication.

In order to replicate the data, MySQL turns the data back into an insert, update, or delete command. In so doing, it fails to properly escape the text. That means that text containing an apostrophe that you properly escaped and inserted into the master will break replication. When the improperly escaped command is sent to the replicated database, it fails, and backs up all commands that come later.

A colegue suggested that we double-escape the text. That is, he preceeded each apostrophe with three slashes, not one. The first two become one slash, and the third escapes the quote. Sure enough, the replicated command was properly escaped. Unfortunately, this led to inconsistent data. The data on the master contained a slash-quote while the data on the replicated database had just a quote. Not a valid solution.

The only solution is to avoid apostrophes and slashes in the data altogether. They must be converted into different characters or dropped from the text. Typical solutions will be to convert apostrophes to ticks and backslashes to forward slashes.

This is not just a problem with text fields. Binary data is treated as text when converted into a command for replication. If that binary happens to contain an apostrophe, then it will have the same problem. The best solution here is to base 64 encode the binary prior to inserting it, then decode it on the way out. The base 64 alphabet contains no special characters, and is therefore safe for replication.