database design for revision history

So, I was working on a personal project yesterday, and I came across a bullet point in my notes that looked like this:
  • implement item revisions
And it stopped me cold in the middle of a pretty productive session.

I originally designed my database to represent the "current" state of the system. So my central table, Item, doesn't yet have a concept of "history" or "latest revision." And I don't mind changing it, it's still a young system and very much work in progress, but the question is, what's the right way to represent an item that's been through an arbitrary number of revisions?

The constraints are:
  • The solution should be clean (normalized or nearly normalized)
  • The solution should be simple to code against
  • The solution should be high performance and very scalable
  • The solution should be applicable to other tables if/as needed
So after reading around and thinking, a couple of options pop up.

1. Rebuild it on the fly:

Never update anything, just add a new "RevisionItem" that points to the original item. When you fetch an item, also fetch its latest RevisionItem, and in most cases render that instead of the original. This will cost you performance, but it does make the revision history absolutely clear, and it means not modifying the Item table. As a side note, this method fits well into the design pattern of this particular project, where items generally have other items (comments, tasks, sponsorships) associated with them.


2. Log the old versions:

You can keep a "ItemRevisionLog" table, which stores old versions of the items, while the Item table always stores the most up-to-date. The advantage is that you don't have to change the Item table at all, and you keep it from bloating. The disadvantage is that you have two tables with the same columns-- not great design, and if you change one you have to change the other.

3. Add an isLatest flag to your data:

You can just keep every version of the Item in the Item table, and use a flag to find the current version. This runs into problems because every Item query now has to do a lot of extra work, and also you need to think very carefully about your primary key. (i.e. when other tables link to an item, they are now linking to many item revisions, not a single row.) Still, it's not unworkable. Just a little ugly.

4. Treat revisions as first-class data, and normalize them:

Make a table called ItemRevision, and put into it everything about an item that you want to be able to revise, and add a couple of columns for date, editor, and item_id. Then take those columns out of Item, and instead give it an ItemRevision reference called "latestRevision." Each revision is now available when needed, and that data is only written once. Retrieval is also a simple join by id. The downside is you have to be very explicit about what is revisable, and you have to change your data design every time you change your mind about what can be revised (e.g. right now).

***

So I'm going to cut to the chase. Number 4 is really the right answer, it seems. Anyway that's what wikipedia does, and their capacity for revision management is [unquestioned? unassailable? practically idiomatic?]. Also I found this great chart of the wikipedia schema, which, if you're into that kind of thing, is pretty cool. It's pretty clean, for all the work it does. Color me impressed.

You can see how they handle page revisions right in the middle of the graph. The Page table points to a Latest Revision, while each Revision table points to Page. A separate Text table stores the actual page text, for reasons that are not immediately clear to me, but that I will assume don't really apply to my much smaller and simpler system. In any case the chart, working in combination with a good night's sleep, has convinced me.

This has been a fun puzzle for me, and once I stop grumbling about rewriting my Item data access to go through a "revision" object, it will make me a better person.

No comments:

Post a Comment