Sunday, November 2, 2014

Use Git to Manage Build Artifacts (Part 1)

No, this does not mean "store them in git". Just don't do that, please.

Artifacts belong into an artifact repo, which can be anything from a shared file system to more sophisticated artifact servers, for example Sonatype Nexus or Artifactory.

What Are We Solving?

A big challenge for getting fast build turnaround is how to avoid rebuilding artifacts that have already been built before.

Most modern build systems have a way to define dependencies to prebuilt artifacts. Maven, for example, defined artifacts via a triplet of group-id, artifact-id and version. Simply listing all these GAV triplets (sometimes expanded via classifiers and extensions) in your maven POM (Project Object Model) file will cause the build to retrieve and use the desired prebuilt artifacts.

The problem with this approach is that usually, in a larger system, developers will work on multiple pieces at the same time. Usually developers need to make a choice whether they want the pieces to all live together in the same source tree within "sub-projects", or whether they wish to "publish" the artifacts in a separate build and then retrieve it in a different build.

Arguably, you want both. If you don't care about your dependencies, you just want to fly with the prebuilt ones and not worry. If you do care, you want to check them out and integrate your changes seamlessly into the build.

Maven's "solution" to this problem is the "SNAPSHOT" dependency. You publish every artifact using a version string ending in "SNAPSHOT". This is a marker where all artifacts published by the same build get stored as the same "version" in your artifact repository. This version is usually "opaque", that is you don't know and shouldn't care which snapshot you are getting as long as it's a consistent set.

Sadly, the SNAPSHOT solution has several problems:
  • The definition of a "consistent set" is more like a happenstance. If it's the same build invocation, you get them - if your build requires multiple invocations (for example because you need to build on multiple architectures), it stops working right. In other words, there is no way to collect artifacts built from a single set of sources - you may get whatever happens to get uploaded last.
  • Since artifacts are cached, how the cache is refreshed matters. You cannot rely on using the latest artifacts to resolve your dependencies except if you clear the cache. And even if you get the latest, see above.
  • When you finalize your release, you need to go through all the version strings and remove the SNAPSHOT keyword, rebuild, publish the release, bump the version and then add the SNAPSHOT keywords back. The infamous maven release plugin will help with that, but in practice this is a lot of churn, and the whole point of using prebuilt artifacts is lost.
  • Snapshots piggy-back on a semantic version scheme which implies an ordering (v2.3.1 is  newer than v2.3.0), so you have no way to support code branches where you don't know or can't commit to any ordering. For example, if you use gerrit as your review system, you have no way of knowing the order in which changes will get merged, but you still need to get efficient validation builds done.
Now, in practice, it often works out fine, but wouldn't it by nice if we had a way to really know what we're getting and why? And wouldn't it be nice if we could just stop fiddling with the version strings?

Why Bother with Prebuilt Artifacts?

Besides saving build time, it soothes the nerves of your QA folks. If they know an artifact hasn't changed (and they know this because it is bitwise identical to their previously tested one), and some problem occurred, then they can start by looking at the artfacts that changed instead of having to examine all of them. Obviously, the bug may have been latent in one of the old artifacts and just triggered by a new usage in the newer ones, but even then looking at the trigger makes troubleshooting a lot simpler.

Finally, rebuilding from the same source never guarantees the same outcome, especially if a lot of time elapsed between now and the previous rebuild.

So, What Can We Do?

Well, we can retrace some of the history of what made git so successful. We have to give up on "readable" version strings. Git replaced the revision number with a SHA1 hash. We can do the same.

A big part of the git magic is achieved by computing SHA1 sums of the file content and accumulating them over the various subdirectories in your source tree. You can easily retrieve those SHA1 sums using the following command:

    % git rev-parse "$ref:$path"

Here, $ref is any commit reference - either a commit hash, a branch name, a tag, or just HEAD. $path is simply the relative path from the top of the git repo to your file or directory. Note that the whole tree is represented by the empty string, not ".".

Since git computes and recomputes these hashes on every commit, rev-parse simply retrieves precomputed values and is therefore very fast.

The next step is to understand your build process. Unless you have a very small project, most build artifacts are generated from a subset of your git repository. The trick is to know which subset it is, and to record this information someplace.

In some build systems (most of the "sane" ones at least), this information is (or should be) encoded in the build script itself. You should simply be able to extract the file list from it.

Sadly, very few shops use sane build systems.

The next best thing is to at least figure out a rough partitioning of your source tree, mapping whole directories if needed.

Some folks go all the way and split their git repos into independent projects, thereby trading one complexity (figuring out the subtrees) into a different complexity (figuring out how to manage collections of git repositories consistently), but the principles used here apply.

Regardless of how you partition your source tree, you can go and list the fole and directory locations required to build your artifact. You then get the SHA1s of those locations from git using the command above, sort them and re-hash them into a new SHA1. That is going to be the version of your artifact.

This method is fast, and has some significant advantages:
  • Assuming you keep the mapping of source tree locations to artifacts in a file under git, you can compute versions the artifacts would have, given any commit, tag, branch, ref, whatever.
  • The build system can check if the artifact already is in your artifact repository. If yes, don't rebuild it. If not, go ahead, and publish it to the repository.
  • A release becomes an atomic non-event. Simply apply a git tag, done. Any deploy script simply performs the computation of the version strings, given the tag, and retrieves the appropriate artidacts from the repository. 

What About Dependencies?

The short answer is to declare the dependencies in the file which maps source locations to artifacts, and when it comes time to compute the version strings, expand those dependencies to produce the transitive closure of the file locations used to build your artifact.

How this can be done is  going to be in part II. See you then!