Historical Trees

In developing Correspondence for the Windows Phone 7, I’ve uncovered a need for a new data structure. I’m calling this data structure a Historical Tree, since it is designed specifically for historical modeling.

Correspondence allows for interchangeable storage strategies. The storage strategies currently implemented are:

  • SSCE (.NET)
  • SQL Server (.NET)
  • SQL Azure (.NET)
  • Postgres (Java)
  • Memory (both)

Most of these are built on a relational database, with the obvious exception of the memory storage strategy. Memory storage was originally intended for unit testing, but can sometimes be useful for small clients. It is not optimized for large data sets.

Windows Phone 7 does not have an out-of-the-box relational database. The storage strategy that you see in the videos is a modified memory strategy. It loads all of the facts from isolated storage into memory on startup. This is incredibly inefficient, and not the way that I’m going to ship. Instead, I’m going to rewrite the isolated storage storage strategy to use a historical tree.

Operations

The operations that a Correspondence storage strategy must support are:

  • Insert a fact having data and references to predecessors.
  • Given a fact reference, load its data and predecessor references.
  • Given a hash code, load a fact reference.
  • Given a predecessor reference, find all successor references.

Update and delete are not valid operations on a Correspondence store. When a fact is created, all of its predecessors are known. Successors are all of the facts that reference a predecessor.

Append only

imageimage This set of operations suggests an append-only data structure.

You can add new facts to the end of the structure, but you cannot insert or remove them from the middle. Furthermore, all predecessor references are fixed and known at the time of insert. These will necessarily be references to facts earlier in the structure.

The only part of the data structure that changes is the list of successors. When a new fact is appended, it has to insert itself into the lists of each of its predecessors. To support this, a fact will participate in one linked list per predecessor.

Each fact has a pointer to the head of its linked list of successors. This is the fact most recently appended that references this fact as a predecessor.

Then, for each predecessor, the fact identifies the role that the predecessor plays, the reference to the predecessor, and the reference to the next successor of that predecessor. To traverse the list of successors, first follow the head. Then find the matching predecessor reference and follow its next pointer. This will visit successors from most recent to least recent.

Finally, the fact has a payload of data. This includes the type, the version, and values of all fields. The data appears last because it is not used while traversing the tree.

The only field of a record that can change is the head pointer (highlighted in red). All other fields are immutable. To protect against corruption, the head is redundant. Each record holds two pointers and an A/B switch. The switch determines which of the two pointers is active. During modification, the new pointer is written to the inactive area, and then the switch is flipped.

Insertion

To insert into this structure, a new record is appended to the file. Initially, the head pointer is null. The next pointers are copies of the head pointers of each predecessor.

Then for each predecessor, the new record’s position is stored as their head. This operation is done using the A/B switch so that the new pointer does not overwrite the active one.

If a fact references the same predecessor twice (possibly in two different roles), then both of the next pointers contain a copy of the previous head. Traversal will follow only the first of them, so the successor will appear only once in the linked list.

Traversal

References to all predecessors are immediately available from any record. References to successors, on the other hand, follow the linked list from the head.

When following the linked list, the predecessor sub-records are scanned for the one matching the desired fact. The next pointer of the first match is followed. If that pointer is null, then traversal is complete.

Hash code

The historical tree supports all but one required operation: hash code to fact reference. This operation is instead supported by a parallel index data structure. A red-black tree seems the best fit for this operation.

When a fact is inserted, its hash code is calculated. That hash code is inserted into a red-black tree, with a reference back to the fact’s position in the historical tree.

Facts are written to the historical tree before being added to the red-black index tree. If the index becomes corrupted, then in the worst case it can be rebuilt from the historical tree. In the best case, the corruption can be repaired by re-adding facts from the end of the historical tree.

Complexity

The complexity of an insert or traversal is not affected by the total size of the data set. Insertion complexity is proportional to the number of predecessors, which is fixed and based on the design of the model. Traversal of predecessors is similarly proportional to the number of predecessors. A well-designed model will limit the number of predecessors.

Traversal of successors is proportional to the number of successors times the average number of predecessors that each predecessor has. It requires walking a linked list, which is O(n). But at each node, the list of predecessors is scanned to find the next pointer. Because of this, we multiply the average number of predecessors. Based on the design of the model, this number is assumed to be limited.

Insertion and traversal of the index is O(log(n)), where n is the number of facts in the store. Indexing becomes slower over time, but not rapidly so. Still, it is recommended that the store be turned over periodically, both to keep the index performant and to release resources on the device.

Turning over the store

Correspondence supports the idea of turning over a fact store. Two stores are kept at any time. One is read-only and the other is read-write. The read-write store collects new facts, and takes a copy of all relevant facts. The read-only store is frozen. It is consulted during queries, and its results merged with the results from the read-write store.

Over time, all of the relevant facts are copied to the read-write store, and the read-only store provides no additional value. After this point, the read-only store is archived and the read-write store becomes read-only. A new empty read-write store is created.

On a mobile device, the old store is destroyed rather than archived. All of its facts have been pushed to a server. It is assumed that any facts that the user still cares about were already copied to the next store.

I am currently working on push notifications to the phone. But when that is finished, I will implement historical trees in isolated storage. This data structure is specifically designed for Correspondence, so I expect it to perform extremely well. In fact, I’m considering porting it back to the desktop as an alternative to SSCE.

Historical tree example.png

Leave a Reply

You must be logged in to post a comment.