Saturday, June 1, 2013

Mission Creep

I just discussed a simple perishable lock service, and of course the usual thing happens: mission creep.

It turns out QA wants to enforce "code freeze", aka no more deploys to certain test environments where QA is doing their thing.

At first, besides being a reasonable requests, it is amazingly easy to add on to the code. Instead of setting the expiration date to "now", we set it to some user specified time, possible way into the future.

But we are planting the seeds of doom....

Remember the initial assumptions of "perishable" locks... that is they are perishable.

In other words, we assume that the lock requester (usually a build process) is the weak link in the chain, and is more likely to die than the lock manager. And since ongoing builds essentially act as a watchdog for the lock manager (i.e. if builds fail because the lock manager crashed, folks will most certainly let me know about it), I can be relatively lax and skimp on things like persisting the queue state. If the lock manager crashes, so what: some builds will fail, somebody is going to complain, I fix and restart the service, done.

But now suddenly, QA will start placing locks with an expiration date far into the future. Now, if the lock manager crashes, it's not obvious anyone will notice immediately. Even if someone notices, it's not clear they will be aware of the state it was in prior to the crash, so there is a real risk of invalidating weeks of QA work.

So, what am I to do?
  • Ignore the problem (seriously: there is a risk balance that can be theoretically computed: the odds of a lock manager crash (which increases, btw, if you add complexity) vs the cost of QA work lost).
  • Implement persistence of the state (which suddenly adds complexity and increases the probability of failure - simplest example being: "out of disk space")
  • Pretend QA is just another build, and maintain a keep-alive process someplace.
The last option appears attractive, as it will not require modifying the lock service itself, but truth be told, it is simply moving the complexity of the problem elsewhere. Now you need to worry about keeping the keep-alive process alive... fun fun...

So what would you do?