Thursday, September 6, 2012

Build Manifest Beats Artifact Registry Service

There's been a long hiatus in posting - I've been busy at a new place, finding ways to apply many of the techniques described in this blog.

One of the items I emphasized on this blog, the Artifact Registry Service, turns out to be somewhat of an overkill. In practice, the functionality overlaps with two well entrenched tools:
  • Distributed VCS, which have well debugged and optimized methods for storing and traversing directed acyclic graphs, otherwise known as revision histories;
  • Artifactory or similar artifact repository systems, combined with ivy or similar publish and  retrieval tools.
Specifically, the java ecosystem has evolved usage patterns that are not worth breaking, and retrofitting an artifact registry service into these patterns is likely wasteful, both in time and mind share. There is a better solution, which has the additional advantage of being a lot simpler to implement.

The basic purpose of the artifact registry service is to answer questions like these:
  • Is my fix in this build?
  • What's new in the build compared to what's in production?
  • What's the difference between two builds?
These questions can be answered using build manifests combined with access to the VCS.

A Build Manifest, in a minimal form, can be a JSON file containing three items:
{ "repo": "ssh://github/someone/wonder-app.git",
  "rev": "1c978e4645c6b117215ea5050dc1e39c90326780",
  "includes": [] }
This JSON file is generated at build time and records the repository and changeset id used for the build. The generated file can then be included in the artifact, which is then published or otherwise reused in subsequent builds and deploys.

If a build depends on other artifacts, then those artifacts can be scanned for build manifests. Suppose the wonder-app in the example above depends on "useful-library". If that library artifact is built using the same method, it would have a build manifest of its own:
{ "repo": "ssh://github/someone/useful-lib.git",
  "rev": "80a8d9b691febae54868835816a7bfea1968a415",
  "includes": [] }
Now, when building wonder-app, the build manifest can reflect this dependency by including the library build manifest in the newly generated build manifest:
{ "repo": "ssh://github/someone/wonder-app.git",
  "rev": "1c978e4645c6b117215ea5050dc1e39c90326780",
  "includes": [

     { "repo": "ssh://github/someone/useful-lib.git",
       "rev": "80a8d9b691febae54868835816a7bfea1968a415",
       "includes": [] }
] }
This is a very simple way of recording the complete dependency chain for a build, and can be extended to representing complete systems running a variety of concurrent services.

Now, to answer the questions listed above, the basic tool is to compare two build manifests, recursing through the dependency hierarchy and matching up the url/revision pairs from each build at every hierarchy level. If the urls in the manifest match up, then the VCS can be accessed and the list of commits included in one but not in the other can be computed.

A practical problem with this approach is obtaining access to the variety of VCS urls which may occur in the build manifest. Even if access permissions are granted, one often has to clone the repositories to examine the revision history, and this might become expensive when the repositories are big.

Fortunately, most companies tend to host the upstream repositories for the code they develop on a central host, so it becomes just a matter of setting up a diff service of sorts on that same host, which could just be a simple CGI script that accepts two build manifests in a PUT or POST, computes the diff between the two, accesses the VCS repositories on the local filesystem and returns a list of commits included in one but not in the other.

Now wouldn't it be wonderful if github and bitbucket had such a service?