Wednesday, May 29, 2013

Annotating Git History

As is well known, git history is sacred. There are good reasons for it, but it does make it difficult sometimes to correct errors and integrate git into larger processes.

For example, it is very common to include references to issue tracking systems in git commit comments. It would be nice if one could mine this information from git in order to update the issues automatically, for example at build time or at deploy time.

Unfortunately, we're all human and make mistakes. Even if you have validation hooks and do the best possible effort, invalid references or forgotten references make it difficult to rely on the commit comments.

It would be nice if it was easier to fix old git commit comments, but history being immutable is one of the core concepts baked into git, and that's a good thing. So I went for a different solution.

How about attaching annotated git tags to commits? That would preserve the history while adding annotations with newer information. The only trouble with this is that the current git commands aren't very good at revealing this information to casual git users.

The convention I currently use is quite simple: attach a tag whose name begins with CANCEL to the commit to be updated.

Let's say we wish to correct this commit:
% git log -1 5a434a
commit 5a434ac808d9a5c10437b45b06693cf03227f6b3
Author: Christian Goetze <>
Date:   Wed May 29 12:57:57 2013 -0700

    TECHOPS-218 Render superceded messages.
Attach the tag, like this:
% git tag -m "new commit message" CANCEL_5a434a 5a434a
Now, the build and deploy scripts can mine this information. They start by listing all the tags attached to the commit:
% git show-ref --tags -d \
     | grep '^'5a434a \
     | sed -e 's,.* refs/tags/,,' -e 's/\^{}//'
Then, for every tag, we check whether it's an annotated tag:
% git describe --exact-match CANCEL_5a434a
Note that this command fails if it's a "lightweight" tag. Once you know it's an annotated tag, you can run:
% git cat-file tag CANCEL_5a434a
object 5a434ac808d9a5c10437b45b06693cf03227f6b3
type commit
tag CANCEL_5a434a
tagger Christian Goetze <> 1369865583

new commit message
This is easy to parse, and can be used as a replacement for the commit comment.

And what if there is another error in the annotation? Simply add another tag with the correction. The number in red above is the unix timestamp expressed as seconds since 1970, so you can sort the annotations by time and just take the latest.

By the way, if anyone knows a nicer way to show all tags attached to a specific commit, please let me know.

Friday, May 17, 2013

Reinventing a Wheel: Hierarchical Perishable Locks

I've written before about the perennial build or buy dilemma, and here it is again.

I was desperate for a solution to the following kinds of problems:
  • I have a bunch of heavyweight tests running in parallel, but using the same database. These tests unfortunately tend to create a mess, and every once in a while the database needs to be reset. During the reset, no access by automated tests should be permitted.
  • Integration and UI tests run against a deployed version of the software. Most of these tests can be run concurrently, but not all - but more importantly, none of those tests will deliver a reliable result if a new deploy yanks away the services while a test against them is running. (yes, I know, they should be "highly available", but they aren't, at least not on those puny test environments).
This cries out for some sort of locking mechanism.

Ideally, it should be a hierarchical locking mechanism to support controlled sharing for concurrent tasks, while still being able to lock the root resource.

If you use naive methods, you are likely to get hurt in several ways:
  • Builds are fickle. They fail, crash or get killed by impatient developers. Any mechanism that requires some sort of cleanup action as part of the build itself will not make you very happy.
  • Developers have a strong sense of fairness and will get impatient if some builds snag a lock right out from under their nose. So we need a first come first serve queuing mechanism. 
  • If you use touch files with timestamps or some similar device, you run up against the lack of an atomic "test and set" operation, and also run the risk of creating a deadlock.
I've googled and searched, and even though I found lots of interesting papers, I haven't found anything remotely close to what I wanted. Most systems I found solve problems way more complicated that what I want here.

I'm sure that the moment I hit publish, I'll locate the original wheel hiding in plain sight someplace. Maybe some kind reader will reveal it for me...

Meanwhile, I have something that works.

The basic idea is simple:
  • My services keeps a hash (or dict) of queues.
  • Every time someone requests a resource, the queue for that resource is either created from scratch, or, if it already exists, searched for an entry by the same requester. 
  • If the request is already in the queue, a timestamp in the request is updated, otherwise the the request is added at the end of the queue.
  • If the request is first in line, we return "ok", otherwise we return "wait".
  • In regular intervals, we go through all queues and check if the head item has timed out. If yes, we remove it, and check the next item, removing all the dead items until we find an unexpired one, or until the queue is empty.
  • If a request to release an resource comes in, we check the queue associated with the resource and set the timestamp to zero. This will cause the request to be automatically purged when it comes up.
This is about 100 lines of code in node. Even though node is quite controversial, it is ideally suited for the task at hand:
  • The event loop model ensures my code "owns" the data while processing a request. No shared memory or threading issues.
  • The requests and responses are very short, so no hidden traps triggered by folks submitting megabyte sized PUT requests .
  • The queues tend to be short, unless you have a big bottleneck, which really means you have other problems. This means that processing the requests is essentially constant time. A large number of resources is not really a problem, since it is easy to shard your resource set.
Adding hierarchical locks to this service is relatively simple. We only need to change two things:
  • We introduce shared vs exclusive locks. Shared locks have many owners, each with their own timestamp. When enqueuing a shared lock, we first check if the last element in the queue is also a shared lock, and if yes, we merge it instead of adding a new lock to the queue.
  • We introduce resource paths, and request an exclusive lock for the last element in the path, but shared locks for all the parent elements.
This appears to give us the right behavior. So, in our database example at the beginning, the tests would each say:
grab mydatabase/mytest
This will create a shared lock on mydatabase, and an exclusive lock on mydatabase/mytest.

If the lock holder reaches the front of both queues, then the grab request will succeed, and the task can proceed.

If another task comes along and requests:
grab mydatabase/yourtest
then the shared lock request for mydatabase is merged, and that task will also be at the head of the queues for all of its resources, and can proceed.

Now, if the big database cleanup task comes along, it will just say:
grab mydatabase
Since mydatabase here is at the end of the resource path, it will request an exclusive lock which will not be merged with the previous ones, but queued after them. The cleanup task will have to wait until all of the owners of the shared lock release their part, and only then can it proceed.

Should one of the test tasks decide to give it a try, they will again end up requesting a shared lock on mydatabase. Since a shared lock cannot be merged with an exclusive lock, the new shared lock will be queued after the exclusive lock, and the test task will have to wait until the cleanup task is done. Fairness is preserved, and it appears that we have all the right properties....

Code is here - please be gentle :)

Monday, May 6, 2013

Why is it so hard to collect release notes (part III)

Moving right along (see parts one and two), we were using a simple system with two entry points and one piece of shared code.

We represent the state using a build manifest, expressed as a  JSON file:

{ "Name": "January Release",
  "Includes: [{ "Name": "",
                "Rev":  "v1.0",
                "Includes": [{ "Name": "common",
                               "Rev":  "v1.0" }]}
              { "Name": "",
                "Rev":  "v1.0",
                "Includes": [{ "Name": "common",
                               "Rev":  "v1.0" }]}]}

After some work, both the Foo team and the Bar team made some independent modifications, each of which made their end happy.

Since we're always in a hurry to release, we defer merging the common code, and release what we have instead. Now we would like to know what changed since the initial state.

First, we construct our new build manifest:

{ "Name": "January Release",
  "Includes: [{ "Name": "",
                "Rev":  "v2.0",
                "Includes": [{ "Name": "common",
                               "Rev":  "v2.0" }]}
              { "Name": "",
                "Rev":  "v1.1",
                "Includes": [{ "Name": "common",
                               "Rev":  "v1.1" }]}]}
In the previous post of the series, we cheated by only modifying one endpoint at a time, and leaving the other one unchanged. Now we change both, and it turns out the answer depends on how you ask the question: git log ^v1.0 v1.1 # in
         git log ^v1.0 v1.1 # in common git log ^v1.0 v2.0 # in
         git log ^v1.0 v2.0 # in common
whole: git log ^v1.0 v1.1      # in
       git log ^v1.0 v2.0      # in
git log ^v1.0 v1.1 v2.0 # in common
Why bring this up? Well, in the previous post I made an argument that it is not only convenient, but necessary to combine the start points of the revision ranges used to get the changes. The same is not true for the end points, as we cannot by any means claim that v2.0 of common is actually in use by

So, in order to preserve the ability to answer the question "what changed only in", as opposed to the whole system, we need to keep the log output separate for each endpoint.

Now we are pretty close to an actual algorithm to compare two build manifests:
  1. Traverse the old manifest and register the start points for every repo
  2. Traverse the new manifest and register the end points for every repo, and which entry point uses the end points.
  3. Traverse all registered repos, and for every registered endpoint, run:
    git log ^startpoint1 ^startpoint2 ... endpoint
    and register the commits and the entry points to which they belong.