Archive for the ‘Historic Modeling’ Category

Mutable fields in Correspondence

Tuesday, March 29th, 2011

image Correspondence is a client-side NoSQL database that synchronizes with other clients through a shared server. If you’ve experimented with the bits, there is a breaking change in the latest changeset. A new build is coming soon, as well as a closed beta of the synchronization server.

Correspondence is built on the concept of historical modeling. This is a set of theory describing historical facts with partial order. The facts have predecessor/successor relationships that define both the order in which they can consistently be applied, and the mechanism for graph traversal.

One of the core tenets of historical modeling is that facts are immutable. This makes them easy to synchronize. Fields also determine the fact’s identity. Two facts having equal fields are actually the same fact.

In many applications, it is desirable to have mutable fields. In historical modeling, you have to use the Entity Property pattern to simulate mutability. This pattern builds on the strengths of historical modeling, allowing participants to unambiguously synchronize state changes. It also has the added benefit of detecting conflicts, and allowing users and applications to resolve them. Unfortunately, this pattern can be cumbersome to apply.

The mutable keyword

A Correspondence model is defined using a language called “Factual”. The upcoming release adds three keywords to the Factual modeling language. These three keywords break a fact into sections:

  • key - Required. Contains all key fields.
  • mutable - Optional. Contains all mutable fields.
  • query - Optional. Contains all queries and predicates.

The key and query sections contain members that have always been in the language. The mutable section adds new functionality.

Fields in the mutable section are mutable. The compiler generates the Entity Property pattern on your behalf. These fields are not part of the key, since they are not actually stored within the fact itself. They are stored in successor facts that track changes.

You write this:

fact Person {
key:
    unique;

mutable:
    string name;
}

The Factual compiler generates this:

fact Person {
key:
    unique;

query:
    PersonName* nameCandidates {
        PersonName n : n.person = this
            where n.isCurrent
    }
}

fact PersonName {
key:
    Person person;
    string value;
    PersonName* prior;

query:
    bool isCurrent {
        not exists PersonName next : next.prior = this
    }
}

Disputable properties

The Entity Property pattern detects conflicts and allows the application to resolve them. To aide the application in this task, Correspondence defines Disputable<T>. This template can be converted to and from the raw type T, but it also exposes two additional properties.

  • InConflict – A boolean that is true if a conflict has been detected.
  • Candidates – A collection of possible values.

Using Disputable<T>, the view model can apply an alternate style when a conflict has been detected, and allow the user to choose from among the candidates to resolve it.

public class PersonViewModel
{
    private Person _person;

    public PersonViewModel(Person person)
    {
        _person = person;
    }

    public string Name
    {
        get { return _person.Name; }
        set { _person.Name = value; }
    }

    public bool ShowConflictStyle
    {
        get { return _person.Name.InConflict; }
    }

    public IEnumerable<string> CandidateNamesItemsSource
    {
        get { return _person.Name.Candidates; }
    }

    public string CandidateNamesSelectedItem
    {
        get { return _person.Name; }
        set { _person.Name = value; }
    }
}

The Factual modeling language specification has been updated to reflect the new mutable keyword. If you’ve created Correspondence models in the past, you will be required to add key and query section headers once you upgrade. Then you will be able to take advantage of mutability.

:s/historic/historical/g

Saturday, October 2nd, 2010

I have moved http://historicmodeling.com to http://historicalmodeling.com. The difference is subtle, but important. A historical fact is one that has occurred in the past. A historic fact is one that has altered the course of history. All past events are historical, but not all are historic.

By calling the set of patterns “Historic Modeling”, I was misusing the word. “Historical Modeling” is more precise.

In a few weeks, historicmodeling.com will go the way of javatuple.com. You will see the old site replaced with a smirking college co-ed linking to dealers of history books.

Historical Trees

Friday, October 1st, 2010

In developing Correspondence for the Windows Phone 7, I’ve uncovered a need for a new data structure. I’m calling this data structure a Historical Tree, since it is designed specifically for historical modeling.

Correspondence allows for interchangeable storage strategies. The storage strategies currently implemented are:

  • SSCE (.NET)
  • SQL Server (.NET)
  • SQL Azure (.NET)
  • Postgres (Java)
  • Memory (both)

Most of these are built on a relational database, with the obvious exception of the memory storage strategy. Memory storage was originally intended for unit testing, but can sometimes be useful for small clients. It is not optimized for large data sets.

Windows Phone 7 does not have an out-of-the-box relational database. The storage strategy that you see in the videos is a modified memory strategy. It loads all of the facts from isolated storage into memory on startup. This is incredibly inefficient, and not the way that I’m going to ship. Instead, I’m going to rewrite the isolated storage storage strategy to use a historical tree.

Operations

The operations that a Correspondence storage strategy must support are:

  • Insert a fact having data and references to predecessors.
  • Given a fact reference, load its data and predecessor references.
  • Given a hash code, load a fact reference.
  • Given a predecessor reference, find all successor references.

Update and delete are not valid operations on a Correspondence store. When a fact is created, all of its predecessors are known. Successors are all of the facts that reference a predecessor.

Append only

imageimage This set of operations suggests an append-only data structure.

You can add new facts to the end of the structure, but you cannot insert or remove them from the middle. Furthermore, all predecessor references are fixed and known at the time of insert. These will necessarily be references to facts earlier in the structure.

The only part of the data structure that changes is the list of successors. When a new fact is appended, it has to insert itself into the lists of each of its predecessors. To support this, a fact will participate in one linked list per predecessor.

Each fact has a pointer to the head of its linked list of successors. This is the fact most recently appended that references this fact as a predecessor.

Then, for each predecessor, the fact identifies the role that the predecessor plays, the reference to the predecessor, and the reference to the next successor of that predecessor. To traverse the list of successors, first follow the head. Then find the matching predecessor reference and follow its next pointer. This will visit successors from most recent to least recent.

Finally, the fact has a payload of data. This includes the type, the version, and values of all fields. The data appears last because it is not used while traversing the tree.

The only field of a record that can change is the head pointer (highlighted in red). All other fields are immutable. To protect against corruption, the head is redundant. Each record holds two pointers and an A/B switch. The switch determines which of the two pointers is active. During modification, the new pointer is written to the inactive area, and then the switch is flipped.

Insertion

To insert into this structure, a new record is appended to the file. Initially, the head pointer is null. The next pointers are copies of the head pointers of each predecessor.

Then for each predecessor, the new record’s position is stored as their head. This operation is done using the A/B switch so that the new pointer does not overwrite the active one.

If a fact references the same predecessor twice (possibly in two different roles), then both of the next pointers contain a copy of the previous head. Traversal will follow only the first of them, so the successor will appear only once in the linked list.

Traversal

References to all predecessors are immediately available from any record. References to successors, on the other hand, follow the linked list from the head.

When following the linked list, the predecessor sub-records are scanned for the one matching the desired fact. The next pointer of the first match is followed. If that pointer is null, then traversal is complete.

Hash code

The historical tree supports all but one required operation: hash code to fact reference. This operation is instead supported by a parallel index data structure. A red-black tree seems the best fit for this operation.

When a fact is inserted, its hash code is calculated. That hash code is inserted into a red-black tree, with a reference back to the fact’s position in the historical tree.

Facts are written to the historical tree before being added to the red-black index tree. If the index becomes corrupted, then in the worst case it can be rebuilt from the historical tree. In the best case, the corruption can be repaired by re-adding facts from the end of the historical tree.

Complexity

The complexity of an insert or traversal is not affected by the total size of the data set. Insertion complexity is proportional to the number of predecessors, which is fixed and based on the design of the model. Traversal of predecessors is similarly proportional to the number of predecessors. A well-designed model will limit the number of predecessors.

Traversal of successors is proportional to the number of successors times the average number of predecessors that each predecessor has. It requires walking a linked list, which is O(n). But at each node, the list of predecessors is scanned to find the next pointer. Because of this, we multiply the average number of predecessors. Based on the design of the model, this number is assumed to be limited.

Insertion and traversal of the index is O(log(n)), where n is the number of facts in the store. Indexing becomes slower over time, but not rapidly so. Still, it is recommended that the store be turned over periodically, both to keep the index performant and to release resources on the device.

Turning over the store

Correspondence supports the idea of turning over a fact store. Two stores are kept at any time. One is read-only and the other is read-write. The read-write store collects new facts, and takes a copy of all relevant facts. The read-only store is frozen. It is consulted during queries, and its results merged with the results from the read-write store.

Over time, all of the relevant facts are copied to the read-write store, and the read-only store provides no additional value. After this point, the read-only store is archived and the read-write store becomes read-only. A new empty read-write store is created.

On a mobile device, the old store is destroyed rather than archived. All of its facts have been pushed to a server. It is assumed that any facts that the user still cares about were already copied to the next store.

I am currently working on push notifications to the phone. But when that is finished, I will implement historical trees in isolated storage. This data structure is specifically designed for Correspondence, so I expect it to perform extremely well. In fact, I’m considering porting it back to the desktop as an alternative to SSCE.

Historical tree example.png

A better Ignite Dallas video

Thursday, March 11th, 2010

Here’s a high quality video of last weekend with slides included.

Ignite video online

Sunday, March 7th, 2010

How do we collaborate through software?

  • The CAP Theorem – Eric Brewer
  • Eventual Consistency – Werner Vogels
  • Event Sourcing – Greg Young and Udi Dahan
  • Partial Order

Download the slides

How do we collaborate through software: Take 2

Thursday, February 25th, 2010

It is official. I have been invited to speak at Ignite Dallas.

In my prior recording of the talk, I neglected to mention the names of some folks who have been influential in collaborative application design. This new recording remedies that oversight.

I also modified the second half of the talk to tell the partial ordering story a little better. It still needs the slides in order to really get it, but fortunately those are coming along.

Give it a listen. Hope to see you next Wednesday!

How do we collaborate through software?

Saturday, February 20th, 2010

I am preparing for the upcoming Ignite Dallas event. 5 minutes, 20 slides that auto advance every 15 seconds. The final speaker list has not yet been determined, but I’m hopeful that I’ll have a chance to present.

I will be talking about the current thinking in collaborative applications.

  • The CAP Theorem
  • Eventual Consistency
  • Event Sourcing
  • Partial Ordering

The challenge is to condense all of this learning into 5 minutes, make it accessible, and make it entertaining. You can help.

Please take five minutes (actually 4:45) to listen to my rough draft. Leave comments on what I can improve, what I didn’t make clear, and anything I could leave out.

Then order your tickets to the event and see some really great performances. Or submit one of your own.

Topic specific sites

Saturday, October 10th, 2009

I now have three sites for topic-specific content.

I have seeded these sites with articles from Adventures in Software. But from now on, articles specific to those topics will go on those sites. Please subscribe to those RSS feeds as well as this one.

A new example of historic modeling

Wednesday, October 7th, 2009

I wrote up an example historic model that illustrates very succinctly some of the major points of the theory. It is a simple work item tracker.

So far, the example model can assign developers to projects. The next step is to allow developers to move work items between folders to track progress.

Currently, the model is only in text form. Soon, I will render it in source code using Correspondence. Eventually, it will integrate with TFS and HP Quality Center to act as a front end and perhaps a conduit between those systems.

If you would like to see a simple and reliable way of writing distributed smart clients, please follow along.

Append only databases, domain events, and historic models

Tuesday, September 8th, 2009

People are starting to talk about the concept of append-only models or domain events:

It's not a new idea, but it has seen increased awareness recently. The idea is that changes (a.k.a. events, or what I call "facts") become a part of the domain model. Instead of performing CRUD operations on records, a system keeps a history of these changes. The changes themselves are modeled as first-class entities, not as side-effects.

The advantages of this approach are many.

  • Auditability - The transaction history is the primary source of record.
  • Synchronization - Merge conflicts are easily detected.
  • Isolation - Publishers and subscribers know nothing about each other.

But there are some details that I haven't heard others mention, yet. These details are going to determine whether people adopt the tools and practices that will emerge from this idea.

Partial order
Many of the implementations of this idea impose a full order on the changes. Each transaction knows the time at which it is committed. The transactions then line up temporally to flow through the system. Before you can process one transaction, you must first have processed the transactions that came before.

This is a pattern I call the "Transaction Pipeline". It is highly restrictive. The only way to know the true result of a transaction is to know all transactions that have come before. This restriction takes what could be a decentralized model and gives it a central authority. The consequence is a loss of local authority.

In Greg Young's video, he draws a system in which changes flow from a client through a server and back again. The client captures the user's intent, then queues up those changes to be sent to the server in serial. The server examines each change, resolves merge conflicts, and updates a central database. The client runs all of its queries against that central database. The local client has no authority. Even though it knows that the user has requested a change, it cannot promise that change until it sees the results in the central database.

If the transactions were not so strictly ordered, the local client could make better decisions about the results of a change. Rather than ordering changes strictly by time, it could impose a partial order on facts. Facts that are related to one another must occur in the proper order. But unrelated facts are allowed to occur in any order. This allows a local client to make decisions based on a subset of the facts. It knows that it has enough information to make the decision, even if it doesn't yet have all of the information.

Parallel storage
Another common implementation detail is that objects are stored in parallel with changes. As the changes come in, they are stored in a queue. As they are processed, their effects are performed on relational data. The relational model can then be queried for current state.

A perfect example is the architecture of NServiceBus. The message bus is added as a mechanism for disparate components to indirectly communicate with one another. Once the component gets the message, it processes the command in a database. This is a natural extension of the way we build systems today.

Unfortunately, this kind of architecture makes it difficult to query the transaction stream itself. Before we can see the effect of a command, it must be executed against relational storage. A local client cannot easily combine the relational snapshot on the server with the outgoing queue of changes that it wishes to impose.

An alternative is to forego relational storage for a mechanism that can query the changes themselves. To find the current state of an entity, it is only necessary to locate all known changes to that entity. If those changes are organized in the right way, that query is as simple and performant as going to the relational database. But the advantage is that we can query through the message queue, not around it.

Disconnected operation
A smart client is often defined as one that can detect whether it is connected or disconnected, and change modes appropriately. These kinds of systems keep a local database as an off-line insurance policy against the eventuality of being disconnected. When the connection is reestablished, the client sends to the server all of the changes that have occurred since the last synchronization.

The problem with this model is that the client needs to switch modes. In "connected" mode, it acts like a dumb terminal. In "disconnected" mode, it acts like a rich client. This yields twice the code to write, test, and maintain. The developer has to decide which features will be available in disconnected mode, and degrade the experience graciously. On top of that is the complexity of switching modes, and the errors that might arise from inconsistencies between the two implementations.

It would be better for a smart client to operate in only one mode. The changes that the user requests are simultaneously committed to a local database and placed in a queue (preferably as one data structure instead of two). The only difference between connected and disconnected operation is whether that queue is currently being serviced. Either way, the application is blissfully unaware. The users changes are captured nonetheless. The developer only has to write the code once, so there is never a decision whether a feature will be supported while off-line. All features are available at all times.

The devil is in the details
Append-only or domain-event models are a powerful idea. But the fact that you only insert and never update or delete is only the beginning. It's great that events are modeled as part of the domain, but if they are just in addition to the domain then it hasn't been taken far enough. For this idea to gain the wide adoption that it deserves, the patterns and tools must impose partial order, forego parallel storage, and support disconnected operation.

For more information on a pattern and a tool that supports these concepts, please see Historic Modeling and Correspondence.