Prepare to Fail: Lessons from the AWS S3 Incident

So as I type this, while obsessively checking the server logs, and #awsoutage on twitter, I am reminded of the Benjamin Franklin quote: "By failing to prepare, you are preparing to fail". Currently Amazon Web Services (AWS) is suffering from an unspecified problem with S3, their Simple Storage Service. This is a place where companies store files and underpins a fair chunk of the internet. Lots of websites will host their content on S3 and many services depend upon it to a degree. AWS has a huge share of the cloud service market so any outage by them is a big deal.

The thing is, S3 has a 99.9% uptime guarantee, this is well into the realms of high availability services but is not 100%. "Three nines" means that in a given month you can expect less than 43.8 minutes of downtime or 8.76 hours per year. As you increase the number of "nines" you get progressively tighter and tighter constraints on how long your system can be down for. At "nine nines" you are down to 2.6297 ms per month, which is sort of terrifying if you think about how long a single API request might take to complete. At S3 levels of service you can expect the system to be mostly up, but have occasional blips. Unfortunately if you host half the internet, a blip becomes headline news.

Here at CriticalBlue we have a much smaller, but still important, impact on the smooth running of the internet, but it is still a top priority for us if our service is interrupted. Our Approov customers depend on us to secure their mobile APIs and if we go down, that has a direct impact on their service. That is why we are always trying to improve our response to various degrees of service outage in amongst all the other competing priorities for the product. We are far from perfect, but I wanted to highlight one aspect of our system that means that we have so far (fingers crossed) been able to maintain our service through this S3 outage.

In order to authenticate mobile apps, we send data from our mobile SDK to our attestation servers. This data is checked against signatures registered with our service when the app is created, to verify it has not been modified. Once we have confirmed the authenticity of the app, we issue a token that it then uses to access the APIs we are protecting. It works in much the same way as a user authentication scheme, but instead of authenticating a user, we are authenticating an app.

We store part of the data we use to perform our app attestation step in S3. Independent of any thoughts of robustness, we are interested in quickly checking apps. To allow us to do that we need to have the data we rely upon available to the machines handling the requests. So we cache the data locally, periodically going out to S3 and our database to retrieve any updated information. Because we have a copy of the data local to each of our attestation machines we can continue to perform checks even if we lose access to it. The worst that should happen is that we do not pick up the latest changes. But if the database or S3 are down, it is unlikely anyone will be able to make any changes anyway.

The solution is not complete by any means, and we have various plans for how to further increase the robustness of our system, but it does mean we can still work even if S3 goes down. I say "if", when what I really mean is "when". The "three nines" uptime guarantee for S3 means that you are pretty much guaranteed that the service will go down at some point, so if you have a service that your customers depend upon, it makes sense to spend a little time thinking about how best to handle the disappearance of vital bits of infrastructure. We also depend upon other AWS services which have less than 100% uptime guarantees, and the more external services you depend upon, the harder it is to maintain your own service levels. And that brings me back to the title of the post, because of our dependency on infrastructure provided by others we have to prepare to fail because in an increasingly cloud based world, it is an absolute certainty that from time to time you will have to deal with major system outages outside your control.

I don't want to tempt fate so I will leave you with this irony: though our service is still operating normally, I will be unable to tell you all about it until S3 recovers, because our website is hosted there and we can't upload any new content. More preparation still definitely required.

News

Jae Hossell

CTO of Approov
As an expert software architect and engineer, Jae brings a profound understanding of computer architecture, algorithms, data structures, and systems design. Over two decades of experience have allowed Jae to master a diverse range of technologies and skills including novel architectures, embedded and mobile operating systems, compilers, virtual machines, desktop applications, and comprehensive full-stack cloud-based services. Jae’s app-security expertise has evolved over the last 10 years, as he has immersed himself in the app-security space to continually advance and develop the Approov mobile security solution.