Saturday, January 21, 2012

Building an Artifact Registry Service (Part 1)

In previous posts, I made the point that builds are precious. Managing the artifacts produced by those builds is a problem that is very similar to source control. In fact, it's so similar that many shops will store build artifacts in the same version control system used for the source itself.

Storing large blobs of binary data in a revision control system has the advantage of simplicity, but also many drawbacks:
  • Revision control systems are optimized for text files and provide features to efficiently diff and merge them. All that machinery is useless for binary blobs.
  • Most revision control systems rely on changeset deltas to efficiently store multiple revisions. This doesn't work well for binary blobs, so your storage requirements for your repository goes way up.
  • Checking out a local copy of the source tree will retrieve all the binary blobs you could perhaps need, but probably won't need. Due to the time it takes to download all those blobs,  people will be reluctant to create multiple checkouts as needed, encouraging the bad practice of developing multiple independent changes in the same workspace. It also encourages use of incremental builds for continuous integration, which poses significant challenges for the build system.
Some of those challenges above can be addressed by smarter buildsystems and better use of multiple repositories or branches, but in the end, you will quickly come to the conclusion that storing binary blobs separately from the source code in dedicated repositories is the right thing to do:
"Version the reference, not the blob."
Once you start doing that, another problem crops up: the references change often. If you store the reference in a source file, you will find yourself editing that file very often, or maybe even having the build system edit the file and check it in. Unfortunately, this practice is very common, as various build systems like ant pride themselves in having built-in features to support it.

Going down the road of having a build system modify source code is a dangerous path. You will need to address the question of how to resolve the inevitable merge conflicts that arise when multiple branches and builds are happening at the same time.

A better way to proceed is to examine how packaging systems resolve these problems. Packaging systems will rarely reference a specific package by version. Usually, the reference is of the form: "I need package X" or "I need at least version 1.3 of package Y". These dependency references need to be edited only rarely, and usually in conjunction with a code change that affects those requirements - and those are perfectly good reasons to edit a source file and cause merge trouble for others.

Packaging systems will then attempt to resolve the dependencies using all the repositories it knows about. This is very similar, btw, to the way most linkers resolve dependencies. You have two components:
  • What you need (think -l flag);
  • Where to find it (think -L flag).
This separates the concerns of defining what you require vs where you'll get it. In a development build, you might want to use your unstable or experimental artifacts, but in a production build you only wish to use blessed production quality artifacts. This separation allows you to avoid having to make source code changes to support the two use cases.

So, to recap:
  • Checking in binary blobs in revision control systems is bad;
  • Checking in explicit references into source files is slightly better, but still bad if those references change often (as they will in a live software development project)
  • Checking in references as requirements and using a system to resolve those requirements using a source of knowledge of what is available is good.
But what is that source of knowledge?
  • A set of artifact repositories
  • A registry service or indexing service to help you locate the right artifact for your build and track exactly what the artifact consists of.
Before we even get there, we really need a good way to specify the version of an artifact, preferably automatically and resulting in a simple version number that behaves in ways humans can understand:
  • It must map to a specific feature set (i.e. source code state)
  • It must increase monotonically over time (i.e. a higher number means a newer version)
The almost universally accepted way to represent a version number works well:
major.minor.patch.buildnr
The trick is to let humans worry about the major.minor.patch bikeshed, but let the build number be generated automatically. As long as the build number by itself fulfills our two conditions above, we're good.

Building a service to do that part is quite simple, as all revision control systems have some method for identifying a source code state:
  • svn and perforce have changeset numbers that already display good monotonic behavior;
  • git and mercurial have changeset hashes, which would need to be converted to numbers by our registry service;
  • Any system will at minimum have a unique id for every file revision, and a changeset can be defined as the hash constructed from the sorted list of all the file revisions touched by that change. Pass that hash to the registry service and obtain the number.
So our artifact registry service would start out as a simple REST service with two calls:
http://<someplace.com>/ars/changeset/<hash>
returning a number, either the number already assigned to the given hash, or a new number if that hash is not known to the system, and:
http://<someplace.com>/ars/build/<nr>
delivering back the hash associated with the build.

With a simple service like this, you can already achieve a lot of things. Most importantly, that number can be used to index into an artifact repository, and since that number is monotonically increasing, it is simple to get the latest build of an artifact.

But, as we will see, once we have this service, it becomes the natural storage location for much of the metadata associated with a build:
  • artifact name
  • artifact repository location(s)
  • platforms and variants built
  • changesets included in that build
  • dependencies included in that build
  • source code repository location(s) used in the build
  • ...
To be continued....



1 comment:

  1. Another reason you don't want to store derived objects in the source tree is because you then have a high risk of inconsistent states. Leaving aside all the issues around merging the results of parallel builds, simply the fact that people have to perform a build prior to checking in is a huge burden (vice versa, having to distinguish between "dirty checkings" and "good checkins" is also very hard - think multi-platform build scenarios).

    It's better to store your derived objects separately and track their lineage.

    ReplyDelete