By VidaVolta in essays — Sep 18, 2023

How to Bake Software

Breaking production is a rite of passage for software developers - except the ones who work on critical systems.

Many consider breaking production to be a rite of passage for software engineers. I like to think this is true for teams developing a travel booking app or maybe a cooking recipe generator startup. But for the realm of software where failures result in more than inconvenience - this maxim hardly applies.

In this critical domain of software engineering, the leash is shorter because the impact of a mistake can be global. From a selfish interest standpoint - software problems can permanently damage the reputation of the vendor. From a societal good standpoint - malfunctioning critical systems can wreak havoc. No matter your motivations - availability is the top priority.

There is a problem though - the same humans, with the same intellects, and the same flaws, work across all of these realms. We hope that the high stakes will encourage extra effort in ensuring changes are safe - but this has its limits. It's impossible to be fully confident, and the empiricism of production tends to be much better at finding issues than our tired analysis.

As a result, many protections exist that make breaking production less likely, or at the very least, less impactful. One of the core principles is deployment strategy. Changes are deployed in waves, first starting with a small, unimpactful subset of the service, and slowly increasing the scope of the deployment as confidence in the change increases.

Promotions from wave to wave are done after waiting a period of time - the bake period - to make sure the system is stable with the new change. It's easy to take this bake period for granted, after all, it's one of many protective mechanisms. However, it's worth inspecting this simple but effective technique to help understand exactly what it's protecting us against.

While working on a with global impact, I once volunteered to fix a lingering issue in an auxiliary pipeline that performed some periodic clean-up tasks. I followed the code-build-test workflow, got the change reviewed, and shipped the fix. A day passed, some early feedback showed the fix to be effective, and I proudly announced it to my team.

Within a day or so, several stages from multiple pipelines waves were in alarm - ouch. The cause was of course the fix I had merged, and so like any courteous production mangler I helped the on-call roll back my code.

But how was this possible? The unit tests passed, the integration tests passed, the test runs that are triggered on each deployment passed, and we had 1 hour bake times in each wave, so how could this bad change have been so widespread?

Well, as it turns out, the fix that I had rolled out contained a bug that manifested after a few hours of runtime. The specifics are not particularly important, what is though, is that this type of bug is not a rarity. Whether the bug is on a seldomly trodden code path, requires a periodic activity to trigger, or is simply a time bomb that's waiting for a certain epoch second to explode - there are endless ways to imagine code that appears functional at first, only to ring our pagers when its least convenient. These latent bugs fall into a few distinct categories:

The Periodic

Imagine a service that nearly saturates at peak business hours every day. If you deploy a performance degradation at 3am, during off-peak hours. It will take hours before your service, now pushed passed the brink, experiences impact.

To avoid this category, we need to identify these periodic aspects of our service. Typically this means load periodicity, and if the system is human facing it will typically be circadian - but this is not always the case. No matter what, the change needs to be baked during relevant peaks and troughs of activity before being promoted. An easy way to accomplish this is to set the bake time to a duration that is long enough to catch these peaks - a common pattern is deploying in the morning and baking for 12 hours.

The Rare

Perhaps the change only impacts an obscure code path that's some legacy holdover from a few engineer-generations ago. It's likely you aren't even aware that this code path is impacted by your change. It could be days, weeks, or even months before the stars align and the bug is triggered.

A comprehensive canary test suite is the primary defense against this category.

The Timed

It would be lovely to imagine every system as stateless. Stateless systems are time-invariant - so if it works at t=0, it will work at t=infinity. Unfortunately, real systems seldom work this way, and its possible your code has a side effect that takes some time to manifest as a bug.

You can hope that the bake time set to catch periodic bugs is enough to catch this case. Otherwise, sometimes we can analyze the code change to deduce whether it might fit this category - and from there an appropriate bake time is typically obvious. I once rolled out a change that modified how a timestamp was updated. A component used this timing field to perform some periodic cleanup if the timestamp was older than 10 days, and as a result, it technically took 10 days of bake time to be fully confident in the deployment.

Alright, so you want to minimize risk, you've come to terms with your non-omniscience, and you are ready to add bake times. The reductionist may set ludicrously long wait periods, but of course - this betrays the very reason you need bake times - you need to make changes.

Making changes too slowly is costly - and in a lot of ways, increases risk. I've seen it happen where slow pipelines stress deadlines, which in turn necessitates hasty roll out plans that may or may not respect the deployment bake times. At the very least, the expedited project drastically increases the risk that mistakes are made in the implementation process.

I've been in the unfortunate (yet, inevitable) position of needing to make design changes late in the lifetime of a project, but because of the pipeline lead time - a function of bake times - there was an underlying tension knowing that there would be no opportunity to make any further changes lest we delayed the project considerably. Inevitably, we needed to make a series of additional changes as the requirements changed, and the pipeline bake times were largely ignored, although with measured risk analysis, to meet the deadline.

As a result, it's important to be reasonable with bake times. Perhaps a long one in the first deployment phase, but after that - be very deliberate about what type of risk you are trying to mitigate. Much like security measures - if they are too inconvenient, they will end up being circumvented.

None of this matters if you don't have the proper monitoring infrastructure in place to detect the misadventure, and the ability to quickly roll back the problematic change. Bake times are one component in a multi-pronged strategy to ensure highly available systems stay that way. Future stories will talk about the other prongs of this "pitchfork".

The Periodic

The Rare

The Timed

Subscribe to VidaVolta