The human risk of failure
I recently lost a flight by 7 minutes because my bus to the airport accrued 1 hour of delays. After trying everything with the staff at the boarding gate, I walked back with my tail behind my legs, through the duty free shops, through security, to buy another plane ticket. I got to talking with the lady at the check-in desk where I learned that airlines overbook seats. If everyone from my upcoming flight was to show up, I'd be reimbursed £400. On top of that, the airline would book the next available flight for me and pay for my overnight hotel stay if needed. Pretty sweet, no? There's even a law which stipulates reimbursements up to £500! However, she said that usually they have around 10 to 12 no-shows, so I shouldn't get my hopes up.
But then I realised: wait, does that mean that airlines defined a percentage for things going wrong in day to day life (transport, Internet problems, bureaucracy, people delays) and they use it to bet money against it through overbooking?
Yes.
We can also use this percentage for other people problems, and maybe think about our software systems in a different light too.
But first, some fun math.
The flight was on an Airbus A321 Neo. According to Airbus, this plane can seat 180-220 people. I used Wizz Air. According to their About page and this news article they have a seating configuration of 239 passengers.
While I was striking a conversation with the new boarding crew, I saw on their monitor that the flight had 12 no-shows, so 5%, and 3 overbookings, 1%. I couldn't find a law that specifies an exact percentage allowed for overbooking, but that doesn't really matter. What matters is that for this particular flight we had a 5% population of no-shows. This number fits with what the lady at the check-in desk told me and with this article that says the no-show percentage is around 5%-15%.
So, humans and the systems they use, in at least part of their life, have a 85% - 95% SLA.
Might be worth accounting for this:
- How about adding this buffer into our on-call rotation SLAs: there's a 5-15% chance the on-call person's internet will fail, or they're stuck on the highway and don't have a good enough 5G connection for their laptop, or can't get a PR approval because the
CODEOWNERS
is broken, etc. - Add 5-15% more time to your project's time estimation. (That, I hope, you've already doubled.)
- Switching gears: there's a 5-15% chance that your sales person won't make the meeting in time.
- Maybe the groom didn't show up because of this SLA, not because they don't love you.
You can also think about engineering and testing in a quirky new way: simulating conditions where certain services fail is akin to overbooking for potential no-show passengers; accommodating for unknown state or incomplete actions in a system (by, for example, implementing a cache or retry logic) is the same as admitting an overbooked passenger on the flight.
If nothing, just take this from my essay: set up contingency plans for any process involving humans and add ~10% more time to the primary plan's duration. I feel like tech engineers and their managers sometimes forget that it's not about how long it takes to solve a problem, it's about the person solving it too.