Monday, March 19, 2012

Good Architecture (Part II)

I found this article by Jay Creps (blog added to the blog roll on the right) to be very interesting, and goes to the heart of whether "heighly scalabe, distributed fault tolerant systems" really are more reliable than classic monolithic systems.

Granted, if you are building a high volume site, you might have no choice, but be honest and analyze whether you really are building that high volume site. There is something to be said about software that has been in production for decades - it might not be cool, but it is certainly well tested and well known...

Oldy but Goody

so true.... see here.


Sunday, March 4, 2012

Slow Rate of Postings

A lot of changes are occurring in my work life, so I've been quite busy. Apologies for the long pauses, but unfortunately posting will remain erratic over the next few weeks.

Release Management Depends on Good Architecture

This is a trivial observation, one would think - but it is amazing how many systems to this day are built without any consideration on how they will be operated.

I heard rumors of a company where the developers had to carry pagers and respond to operational emergencies, enhanced by the habit of the head of operations of randomly turning off machines in the data center. This apparently did lead to developers writing more robust code so they could get some sleep.

This is a good moment to take a peek at an older Usenix paper on Crash Only Software. Anyone designing software used as a service these days really needs to read and apply these principles. There isn't any excuse in this day and age for doing anything else:
  • DNS infrastructure now supports looking up hosts by service, so no client or peer service should have to be configured with explicit host names.
  • Hardware load balancers will not only shape traffic, but also deal with outages.
  • State and persistence is concentrated onto backend data services, where well tested redundancy and replication mechanisms exist.
The primary implication for release management is:
An upgrade is like an outage.
Simply knock out a small portion of your services, upgrade them, restart them and let the load balancer do its job.

As a consequence, any new version must be fully compatible with the previous version, both in accepting incoming requests and when emitting requests. This means:
  • Modifying an API usually takes two releases: (1) Add new functionality and deploy, and only after all services are upgraded perform (2) remove obsolete functionality.
  • Modifying database schemas often takes two releases: (1) add new tables and relations, and slowly start converting or moving the data - code needs to be able to deal with data in both the new and the old location, and only then (2) drop obsolete columns or tables.
Our job as release managers is to track the feature set and the testing performed to ensure that all API changes conform and won't endanger the currently deployed system. If that's done, deploys become complete non-events.