Fail slow to recover fast

June 3, 2024

Back to work after a relaxing long weekend.

Today, intercontinental active/active deployments for API traffic, naturally on top of HTTP.

The keywords there already hint that you have multiple load balancers, at least a pair. What happens after each load balancer, say, traffic entering Americas and being served by Europe back-end, is outside the scope.

You want to control how quickly the API clients stop hitting the Americas load balancer if that Region went down.

This is for internal traffic so you can’t rely on magical third party CDN.

How quickly can we flip all traffic away from one of those environments?

Is it really the question?

Is the understanding of four nines that you MUST NOT be offline for more than 8.6 seconds a day?

Or is it that you should design so that, over the course of an year, you don’t end up exceeding 52 minutes of unscheduled downtime?

I’m still reasonably skeptical of these quick-failover methods, I did suggest a two-Lambda running in different AWS Regions running in odd/even minutes and checking the systems. This way you can get a pretty good coverage of both views of the world, cross-Region, resilient to several cases, and not paying the exorbitant $2.5 per hour for every Route53 ARC cluster configured.

As controversial as the above sounds, Lambdas are a great way to cover gaps in AWS offering.

Maybe one should just pay the $2000 pcm instead?

Still, there’s an important lesson on GitHub’s October 2018 outage where a network partition ended up with a split-leader where both sides accepted writes to the database then couldn’t synchronise anymore.

Writing this, I remember a gig almost twenty years ago now, when I got contracted to write a program to reconcile the billing information of a VoIP system that had incorrect information after a split-head on a MySQL cluster.

Recovering from these situations is hard, and we don’t stumble upon the right decisions. They require careful planning.

As an addendum, my suggestion about discovering your tolerance for a four nines SLA is to write more than just diagrams. Write the story, explain to yourself and to your stakeholders what are you trying to achieve. For example:

From two independent locations, set up the monitoring and decision making program.

Every odd minute, the program runs from location A.

Every even minute, the program runs from location B.

Both programs have the same set of inputs, verify that the service is available on locations A and B – the monitoring program checks its own Region and its sibling.

If the monitoring program is unable to reach one of the locations, it updates a DNS TXT record recording it intends to shift the traffic.

Its counter-part on the other Region can then validate this and, if it agrees, shift the traffic to just the active Region.

If both monitoring locations disagree, make no change of traffic pattern and raise an alert to the operator as there’s a potential split-head.

This operation for traffic selection is entirely DNS-based, with a minimum 60s TTL on the A/CNAME record on data path.

This means that, if a Region went down, it took about 3 minutes to take the decision to shift traffic.

That’s pretty much the entire month’s SLA on four nines but you have a huge degree of confidence that you are not doing it too quickly.

It must be recorded that the clients will cache DNS for some time, so even if the DNS record is updated, some clients will be unable to reach the location that became unavailable for longer than those 3 minutes.

Writing a story like this helps highlight all assumptions you are making or missing, and allows you to consider whether other trade-offs, for instance running the monitoring program at shorter intervals, are worth the cost.