Archive for March, 2010

Map IDs during data import

Thursday, March 25th, 2010

It’s always best to develop against realistic data. No data is more realistic than production data. If you can legally and ethically import production data to development, you should definitely do so.

We’re developing a new rebate program for pharmaceuticals. Rather than creating a dummy product catalog, we would rather import the actual catalog from production.

The problem is the auto-incrementing IDs. We’re not just pulling in a flat structure; we’re maintaining relationships in a tree structure. When we insert rows into the dev database, these rows get different IDs than the production rows. As a result, the foreign key relationships get broken.

One solution to this problem is to drop all of the data and do an identity insert. Temporarily turn off the auto-incrementing behavior of the primary key so that you can specify exactly what it should be. This brings all of the production IDs into dev, so the foreign key relationships still line up.

The problem with that solution is that you have to drop all of the existing dev data in order to import. We have other structures that are loosely related to the product catalog. Dropping the product catalog would require us to drop some of the tables used by these other systems. All of the test data we loaded for those systems would be lost.

Here’s my solution
Instead of enforcing that the development IDs are equal to the production IDs, we can instead record a map. By joining to this map, we can insert the development IDs into foreign key columns, preserving the relationships from production. Here’s how we do it.

Step 1: Restore a backup of production data into a new database on the development server. You might also use replication for this, if you already have a subscription. Be sure that you get only the subset that you are legally and ethically permitted to use. In our case, the product catalog is what we need, but we do not restore any tables containing patient information.

Step 2: Identify the natural key. The ID that you are mapping is the surrogate key. The natural key is the set of columns that tell you that this is conceptually the same row. In our case, we are using the drug class name and the product group name.

Step 3: Copy rows that do not already exist based on the natural key. Use INSERT … SELECT … WHERE NOT EXISTS for this operation. For example:

INSERT INTO [Dev].dbo.[DrugClass]
           ([Name])
SELECT [Name]
  FROM [Prod].dbo.[DrugClass] s
  WHERE NOT EXISTS (SELECT * FROM [Dev].dbo.[DrugClass] t WHERE t.Name = s.Name)
GO

Step 4: Create a temp table to map surrogate keys. Populate this table by joining the natural keys.

DROP TABLE #DrugClassMap
GO

SELECT s.DrugClassId as SourceDrugClassId, t.DrugClassId as TargetDrugClassId
  INTO #DrugClassMap
  FROM [Prod].dbo.[DrugClass] s
  JOIN [Dev].dbo.[DrugClass] t ON s.Name = t.Name
GO

Step 5: Use the temp table to import related tables. Join to the source table by source ID, and insert by target ID.

INSERT INTO [Dev].dbo.[ProductGroup]
           ([DrugClassId]
           ,[Name])
SELECT m.[TargetDrugClassId]
      ,s.[Name]
  FROM [Prod].dbo.[ProductGroup] s
  JOIN #DrugClassMap m ON m.SourceDrugClassId = s.DrugClassId
  WHERE NOT EXISTS (SELECT * FROM [Dev].dbo.[ProductGroup] t WHERE t.Name = s.Name AND t.DrugClassId = m.TargetDrugClassId)
GO

Now you can run the script anytime you need to import data. You won’t destroy existing data in the process.

Pass through view model

Sunday, March 21st, 2010

I will be speaking at the Dallas XAML user’s group on May 4. I’ll guide the group through the creation of a simple MVVM application to demonstrate why you want to use the pattern. This is a hands-on format, so bring your laptop. We will be using Visual Studio 2008 and WPF, so no special downloads will be required.

In preparing for this presentation, I’ve researched much of the MVVM advice currently available. Here are just some of the sources that I’ve found:

In almost every example, the presenter copies data from their Model into the View Model. Some don’t even have a Model at all! I, on the other hand, create pass-through View Models. My View Models don’t store any data. So either I’m doing it wrong, or my idea of MVVM is different from everybody else’s. Before I present the pattern to a room full of people with laptops, I’d like to know which.

Disconnected View Model
The typical example of a View Model example keeps a copy of its data. The properties work against local fields.

public class PersonViewModel : INotifyPropertyChanged
{
    public event PropertyChangedEventHandler PropertyChanged;

    private string _firstName;
    private string _lastName;

    public string FirstName
    {
        get
        {
            return _firstName;
        }
        set
        {
            _firstName = value;
            FirePropetyChanged("FirstName");
            FirePropetyChanged("FullName");
        }
    }

    public string LastName
    {
        get
        {
            return _lastName;
        }
        set
        {
            _lastName = value;
            FirePropetyChanged("LastName");
            FirePropetyChanged("FullName");
        }
    }

    public string FullName
    {
        get
        {
            return string.Format("{1}, {0}", _firstName, _lastName);
        }
    }

    private void FirePropetyChanged(string propertyName)
    {
        if (PropertyChanged != null)
            PropertyChanged(this, new PropertyChangedEventArgs(propertyName));
    }
}

Before you display the view, you have to copy the data into the view model. At some point later, you copy the data back out of the view model.

public class PersonViewModel

    //...

    public static PersonViewModel FromPerson(Person person)
    {
        return new PersonViewModel()
        {
            FirstName = person.FirstName,
            LastName = person.LastName
        };
    }

    public void ToPerson(Person person)
    {
        person.FirstName = _firstName;
        person.LastName = _lastName;
    }
}

Some of the experts just skip the model class altogether and go straight from their view model to the database or service.

Pass Through View Model

I prefer the pass through style. The view model stores no data.

public class PersonViewModel : INotifyPropertyChanged
{
    public event PropertyChangedEventHandler PropertyChanged;

    private Person _person;

    public PersonViewModel(Person person)
    {
        _person = person;
    }

    public string FirstName
    {
        get
        {
            return _person.FirstName;
        }
        set
        {
            _person.FirstName = value;
            FirePropetyChanged("FirstName");
            FirePropetyChanged("FullName");
        }
    }

    public string LastName
    {
        get
        {
            return _person.LastName;
        }
        set
        {
            _person.LastName = value;
            FirePropetyChanged("LastName");
            FirePropetyChanged("FullName");
        }
    }

    public string FullName
    {
        get
        {
            return string.Format("{1}, {0}", _person.FirstName, _person.LastName);
        }
    }

    private void FirePropetyChanged(string propertyName)
    {
        if (PropertyChanged!= null)
            PropertyChanged(this, new PropertyChangedEventArgs(propertyName));
    }
}

Instead of storing a copy of the Person data, this view model uses the storage that is in the actual Person model. There is no back-and-forth copying of data.

The advantages of pass through

I prefer the pass through style because it limits the number of moving parts in the application; in other words, it minimizes the “degrees of freedom”. Fewer moving parts means:

  1. Less code
  2. Fewer tests
  3. Fewer things that could break
  4. Simpler communication

If data exists in two places, then it will diverge. Data in one place will be more up-to-date than data in the other place. You have to write code to manage the divergence. That code needs to be tested. That code could break.

Many of the MVVM frameworks that the experts use have a mechanism for sending messages from one view model to another. This is necessary because each view model has its own copy of the data. But if each view model passes through to the same model object, then message passing is no longer necessary. The data model itself becomes the means of communication between view models.

The disadvantages of pass through

There is a trade-off between the disconnected and pass through styles. The pass through style does have some disadvantages:

  1. Harder to support Cancel
  2. Requires change notification from model
  3. Difficult to represent complex transformations

Some applications call for the modal dialog OK/Cancel metaphor. It is easier to implement this metaphor if you copy the data into a disconnected view model. If the user presses cancel, you just don’t copy the data back to the model.

While view models can use the data model to share data, they need to be notified of changes that other view models are making. That means that your view models not only implement INotifyPropertyChanged, they also subscribe to it. This can lead to a lot of bookkeeping code. Fortunately, there is a solution to this, but you have to use a third-party library.

Some transformations cannot be easily represented as a pass through. Maybe the objects in the data model need to be combined or regrouped in different ways. This regrouping cannot be easily done with pass through properties. A disconnected view model can make large scale transformations during the copying process.

None of the experts that I’ve studied make a distinction between disconnected view models and pass through view models. This is unfortunate, because the differences are significant. Where the advantages of a pass through view model outweigh the disadvantages, I find it to be an excellent choice. But before you can make that choice, you have to know that there is a choice to be made.

A better Ignite Dallas video

Thursday, March 11th, 2010

Here’s a high quality video of last weekend with slides included.

Expression of Intent

Monday, March 8th, 2010

This weekend I worked on the Factual language parser. I started with a brute-force recursive descent parser, but ended up with something more elegant. The final product expresses the intent better than the original code did. It is easier to maintain, and easier to extend. You can follow along with the changesets on Codeplex. Here’s how I did it.

Step 1: Make it work
The goal was to create a parser for a known language. It was not to create a framework for any language that you could throw at it. My constraints were well defined:

  • Parse the Factual language
  • Produce an Abstract Syntax Tree
  • Work within an open source license

These constraints defined my solution. Since I am working on an open source project, I did not want to take a dependency upon a proprietary parser generator. And since I knew the language that I was parsing, I was able to choose parser patterns that fit that language.

It would be foolish to take on this project without knowing the theory of compiler design. I have seen too many ad-hoc parsers created by people who did not do their research. The mistakes that they made could have easily been avoided. “Make it work” does not mean throwing together the simplest thing that would work. It means think it through, but go straight at the problem. Persistence is no substitute for understanding.

Knowing the theory, I knew that my language as defined was LL(4). It is unambiguous with a lookahead of only 4 tokens. I also knew that with a slight modification, I could transform it into an LL(1) grammar. This kind of grammar is easily parsable using recursive descent.

I started with this unit test:

[TestMethod]
public void WhenNamespaceHasNoDot_NamepaceIsRecognized()
{
    Parser parser = new Parser(new StringReader("namespace GameModel;"));
    Namespace result = parser.Parse();
    Pred.Assert(result.Identifier, Is.EqualTo("GameModel"));
    Pred.Assert(result.LineNumber, Is.EqualTo(1));
}

I did not write this code, as some proponents of TDD would recommend:

public Namespace Parse()
{
    return new Namespace("GameModel", 1);
}

Instead, I wrote this:

public Namespace Parse()
{
    Consume();
    Namespace namespaceNode = MatchNamespace();
    return namespaceNode;
}

private Namespace MatchNamespace()
{
    Token namespaceToken = Expect(Symbol.Namespace, "Add a 'namespace' declaration.");

    if (Lookahead.Symbol != Symbol.Identifier)
        throw new FactualException("Provide a dotted identifier for the namespace.", Lookahead.LineNumber);
    string namespaceIdentifier = MatchDottedIdentifier();

    Expect(Symbol.Semicolon, "Terminate the namespace declaration with a semicolon.");

    return new Namespace(namespaceIdentifier, namespaceToken.LineNumber);
}

That first method would have made the test pass, but it represents none of the understanding of the problem space. I find it better to capture that understanding in code as soon as possible, rather than writing a foolish method that I will throw away later.

After getting it working, I had a suite of unit tests. Each subsequent step was performed while keeping that same test suite passing.

Step 2: Clean up the working code

It bothered me that MatchNamespace knew that MatchDottedIdentifier expected an Identifier. So I did a quick refactoring to move that knowledge closer to the right place.

private Namespace MatchNamespace()
{
    ...

    if (!StartOfDottedIdentifier())
        throw new FactualException("Provide a dotted identifier for the namespace.", Lookahead.LineNumber);
    string namespaceIdentifier = MatchDottedIdentifier();

    ...
}

private bool StartOfDottedIdentifier()
{
    return Lookahead.Symbol == Symbol.Identifier;
}

private string MatchDottedIdentifier()
{
    StringBuilder result = new StringBuilder();

    Token idenifier = Expect(Symbol.Identifier, "Begin with an identifier.");
    result.Append(idenifier.Value);

    ...
}

Step 3: Find the patterns

After cleaning the code in a similar way in several places, I noticed that there were Start and Match pairs for almost every rule. I compared this with my knowledge of parsers, and found that it is a valid generalization.

In an LL(1) grammar, you can calculate the First and Follow sets of every production. The First set is the set of tokens that could appear at the beginning of the production. The Follow set is the set that could appear after the production has been reduced. I recognized that Start represented the First set. The Follow set is useful for generating the First set (First includes Follow if the production could be empty), but was not strictly required for my grammar.

Having discovered that pattern, I extracted a common interface (actually an abstract base class).

public abstract class Rule<T>
{
    public abstract bool Start(Symbol symbol);
    public abstract T Match(TokenStream tokenStream);
}

I implemented this interface with one derived class per production.

Step 4: Find the common intent

Several of the rules had a common structure. They each called a sequence of other rules in turn. Each time, they would check the Start, and then call Match. This commonality was not accidental. Each of these rules intended to represent a sequence. They only differed in which sequence, and what was done with it. So I factored out that intent.

public class RuleSequence3<T1, T2, T3, T> : Rule<T>
{
    private Rule<T1> _rule1;
    private Rule<T2> _rule2;
    private string _error2;
    private Rule<T3> _rule3;
    private string _error3;
    private Func<T1, T2, T3, T> _reduce;

    public RuleSequence3(Rule<T1> rule1, Rule<T2> rule2, string error2, Rule<T3> rule3, string error3, Func<T1, T2, T3, T> reduce)
    {
        _rule1 = rule1;
        _rule2 = rule2;
        _error2 = error2;
        _rule3 = rule3;
        _error3 = error3;
        _reduce = reduce;
    }

    public override bool Start(Symbol symbol)
    {
        return _rule1.Start(symbol);
    }

    public override T Match(TokenStream tokenStream)
    {
        T1 value1 = _rule1.Match(tokenStream);

        if (!_rule2.Start(tokenStream.Lookahead.Symbol))
            throw new FactualException(_error2, tokenStream.Lookahead.LineNumber);
        T2 value2 = _rule2.Match(tokenStream);

        if (!_rule3.Start(tokenStream.Lookahead.Symbol))
            throw new FactualException(_error3, tokenStream.Lookahead.LineNumber);
        T3 value3 = _rule3.Match(tokenStream);

        return _reduce(value1, value2, value3);
    }
}

Step 5: Express the intent

Now that rules can be built from the outside rather than coded from within, I had the opportunity to create a fluent interface. The code that uses this interface expresses the grammar of the language, rather than the steps required to parse it. The resulting code reads like a specification rather than an implementation.

// dotted_identifier -> identifier ("." identifier)*
// namespace_declaration -> "namespace" dotted_identifier ";"
var dottedIdentifier = Separated(
    Terminal(Symbol.Identifier),
    Symbol.Dot,
    identifier => new StringBuilder().Append(identifier.Value),
    (stringBuilder, identifier) => stringBuilder.Append(".").Append(identifier.Value));
var namespaceRule = Sequence(
    Terminal(Symbol.Namespace),
    dottedIdentifier, "Provide a dotted identifier for the namespace.",
    Terminal(Symbol.Semicolon), "Terminate the namespace declaration with a semicolon.",
    (namespaceToken, identifier, ignored) => new Namespace(identifier.ToString(), namespaceToken.LineNumber));

I find that this code is much easier to maintain and extend. The engineering of getting the parser right has been done once. Additional productions can take advantage of that engineering, so that simple mistakes are avoided. The goal was no reuse, but in the end I had created a reusable parser generator.

I might have been able to go straight there, but only at the risk of over engineering the solution. By refactoring my way from a working system toward an expression of intent, I was able to bring in just the theory that this particular problem requires. I know the limitations of my parser, and I can live within them.

Ignite video online

Sunday, March 7th, 2010

How do we collaborate through software?

  • The CAP Theorem – Eric Brewer
  • Eventual Consistency – Werner Vogels
  • Event Sourcing – Greg Young and Udi Dahan
  • Partial Order

Download the slides