Adventures in Engineering

Releasing software during the holidays

And so we reach the end of another coding-year, and while many of us are plotting our holiday hacks and offline AR (actual reality) adventures, there are a few folks who still have to launch features over the break.

And so we reach the end of another coding-year, and while many of us are plotting our holiday hacks and offline AR (actual reality) adventures, there are a few folks who still have to launch features over the break.

Please don’t.

Ok, so now you’re charging ahead anyway, without your whole team on board, it’s a period of heightened risk and it’s important that you approach production changes with an appropriate level of caution and care.

Check with suppliers

This is probably the most common cause of panic I’ve experienced both as an engineer and as one of those suppliers.

If you’re launching something while your team is on holiday, chances are your partners, suppliers and 3rd parties are probably also in a similar situation. If you haven’t already, reach out to them and know exactly what their support commitment and availability is. Know the escalation paths available as well, just in case.

Often suppliers (our team included) have alternative support arrangements during holiday periods; not knowing this will slow down your ability to get help should you need it.

Are other teams informed?

Unless you’re a solo team (chances are you’re not), then do the other parts of the business know that changes may be happening? If you’re launching a new feature, have you let the social, support and press teams know (because if it goes wrong they’re bound to be the first responders).

Do you know how to escalate incidents to other teams? If you depend on an internal infrastructure team can you raise an incident? If you consume an API, do you know how to contact the team managing it? While your changes might not be to their code you’re still one part of a complex technical ecosystem, so play nice.

If, for example, your new feature calls an internal API twice as often then initial tests might look like their service can handle the increased capacity you’re expecting of them. Internally, however, their queue might be backing up, internal logs filling up disks and other unexpected capacity planning issues might come into play. Please don’t be that team that causes someone else pager to go off.

How reliable and reproducible is your build system?

This speaks to two key indicators of a good build pipeline – test coverage and having reproducible builds.

In the first instance, knowing your test processes (automated and exploratory) goes a long way to building confidence in holiday release cycles. It’s not just a raw metric for test coverage, but also the smoke testing, fuzzing etc. that contribute to having a reliable build that you trust.

A reproducible build, however, also gives you confidence that you have a known and reproducible checkpoint in your code.

In the end, your build system is your safety net to making reliable software regardless of the increased end-of-year risk. If your builds are flakey, my advice is not to risk it.

Are you making additive only changes to databases?

It might seem small but databases are complex, and anyone who tells you otherwise has clearly never had a data migration cause trouble in production… especially when the experts in your team are on a 14-hour flights over the Pacific.

Additive-only schema changes are a fantastic way to ensure that changes to your new feature cannot impact existing code dependant on a given schema.

By only ever rolling forward on your database you’ll be confident that your always-new updates to the schema are incredibly unlikely to cause issues with other features.

Is rollback part of standard practice?

I don’t just mean backups and restoration. The maturity of your change management is a crucial factor in knowing the risk and impact of having to rollback.

Should you require a backup restoration, how old is that backup? Who can retrieve if and who knows how to respond? If your colleagues are also on leave, does your reduced capacity team have the on-hand knowledge to confidently apply this change, and rollback without issue?

Ideally, rollback should be automated (even if it is manually triggered) but making it part of standard practice to know how to rollback and recover from a bad release is crucial for releases confidence.

I know very few teams that can release like Homer: “build goes forward, build goes back, build goes forward, build goes back”.

Can you control customer experience?

Chances are you’ll have a success metric riding on this important holiday work so separating the client experience from your zero-downtime-deploy is only logical.

When you do finally deploy code to a production environment, is it immediately visible to customers or do you have a feature flagging or gating mechanism to control client experience? If you’re launching a new UI can customers opt-in or opt-out? If a feature is about to start behaving differently can clients opt back into the old behaviour if needed?

Having feature release independent of code changes is an excellent idea on many levels, but over the holiday break, it’s about separation of risk and change coordination.

Ok, so this is just a starting point, and much of it is just standard best-practice engineering with a hint of caution. If are charging ahead with a holiday release in the next two weeks then I wish you (and your colleagues and your suppliers, and your users) good luck! Hopefully, the above points help you do some proactive planning, so you’re ready for whatever might happen.