Storing large blobs of binary data in a revision control system has the advantage of simplicity, but also many drawbacks:
- Revision control systems are optimized for text files and provide features to efficiently diff and merge them. All that machinery is useless for binary blobs.
- Most revision control systems rely on changeset deltas to efficiently store multiple revisions. This doesn't work well for binary blobs, so your storage requirements for your repository goes way up.
- Checking out a local copy of the source tree will retrieve all the binary blobs you could perhaps need, but probably won't need. Due to the time it takes to download all those blobs, people will be reluctant to create multiple checkouts as needed, encouraging the bad practice of developing multiple independent changes in the same workspace. It also encourages use of incremental builds for continuous integration, which poses significant challenges for the build system.
"Version the reference, not the blob."Once you start doing that, another problem crops up: the references change often. If you store the reference in a source file, you will find yourself editing that file very often, or maybe even having the build system edit the file and check it in. Unfortunately, this practice is very common, as various build systems like ant pride themselves in having built-in features to support it.
Going down the road of having a build system modify source code is a dangerous path. You will need to address the question of how to resolve the inevitable merge conflicts that arise when multiple branches and builds are happening at the same time.
A better way to proceed is to examine how packaging systems resolve these problems. Packaging systems will rarely reference a specific package by version. Usually, the reference is of the form: "I need package X" or "I need at least version 1.3 of package Y". These dependency references need to be edited only rarely, and usually in conjunction with a code change that affects those requirements - and those are perfectly good reasons to edit a source file and cause merge trouble for others.
Packaging systems will then attempt to resolve the dependencies using all the repositories it knows about. This is very similar, btw, to the way most linkers resolve dependencies. You have two components:
- What you need (think -l flag);
- Where to find it (think -L flag).
So, to recap:
- Checking in binary blobs in revision control systems is bad;
- Checking in explicit references into source files is slightly better, but still bad if those references change often (as they will in a live software development project)
- Checking in references as requirements and using a system to resolve those requirements using a source of knowledge of what is available is good.
- A set of artifact repositories
- A registry service or indexing service to help you locate the right artifact for your build and track exactly what the artifact consists of.
- It must map to a specific feature set (i.e. source code state)
- It must increase monotonically over time (i.e. a higher number means a newer version)
major.minor.patch.buildnr
The trick is to let humans worry about the major.minor.patch bikeshed,
but let the build number be generated automatically. As long as the
build number by itself fulfills our two conditions above, we're good.
Building a service to do that part is quite simple, as all revision control systems have some method for identifying a source code state:
- svn and perforce have changeset numbers that already display good monotonic behavior;
- git and mercurial have changeset hashes, which would need to be converted to numbers by our registry service;
- Any system will at minimum have a unique id for every file revision, and a changeset can be defined as the hash constructed from the sorted list of all the file revisions touched by that change. Pass that hash to the registry service and obtain the number.
http://<someplace.com>/ars/changeset/<hash>returning a number, either the number already assigned to the given hash, or a new number if that hash is not known to the system, and:
http://<someplace.com>/ars/build/<nr>delivering back the hash associated with the build.
With a simple service like this, you can already achieve a lot of things. Most importantly, that number can be used to index into an artifact repository, and since that number is monotonically increasing, it is simple to get the latest build of an artifact.
But, as we will see, once we have this service, it becomes the natural storage location for much of the metadata associated with a build:
- artifact name
- artifact repository location(s)
- platforms and variants built
- changesets included in that build
- dependencies included in that build
- source code repository location(s) used in the build
- ...
Another reason you don't want to store derived objects in the source tree is because you then have a high risk of inconsistent states. Leaving aside all the issues around merging the results of parallel builds, simply the fact that people have to perform a build prior to checking in is a huge burden (vice versa, having to distinguish between "dirty checkings" and "good checkins" is also very hard - think multi-platform build scenarios).
ReplyDeleteIt's better to store your derived objects separately and track their lineage.