I heard rumors of a company where the developers had to carry pagers and respond to operational emergencies, enhanced by the habit of the head of operations of randomly turning off machines in the data center. This apparently did lead to developers writing more robust code so they could get some sleep.
This is a good moment to take a peek at an older Usenix paper on Crash Only Software. Anyone designing software used as a service these days really needs to read and apply these principles. There isn't any excuse in this day and age for doing anything else:
- DNS infrastructure now supports looking up hosts by service, so no client or peer service should have to be configured with explicit host names.
- Hardware load balancers will not only shape traffic, but also deal with outages.
- State and persistence is concentrated onto backend data services, where well tested redundancy and replication mechanisms exist.
An upgrade is like an outage.Simply knock out a small portion of your services, upgrade them, restart them and let the load balancer do its job.
As a consequence, any new version must be fully compatible with the previous version, both in accepting incoming requests and when emitting requests. This means:
- Modifying an API usually takes two releases: (1) Add new functionality and deploy, and only after all services are upgraded perform (2) remove obsolete functionality.
- Modifying database schemas often takes two releases: (1) add new tables and relations, and slowly start converting or moving the data - code needs to be able to deal with data in both the new and the old location, and only then (2) drop obsolete columns or tables.