AWS outage teaches valuable lesson about redundancy

aws outage.png

On August 31st, one of Amazon’s AWS US-EAST-1 data centres in North Virginia experienced an unexpected power outage. Despite the fact that the multinational tech giant’s backup generators promptly came online, their instability caused several high-traffic websites, such as Reddit, to suffer from hours of downtime.

According to their official statements, Amazon was able to recover at least 95% of the data centre. However, some of the hardware suffered irreversible damage, which led to permanent data loss for some companies hosted exclusively in the US-EAST-1 region. Although this was a rare occurrence, the fallibility of even Amazon’s cloud service reminds us that no solutions is unfailing

Here are some of our best practises to ensure that your cloud-hosted app remains highly available.

Maintain Standby Databases and Redundant App Servers In A Different Availability Zone 

Redundancy is key. Large corporations maintain redundancy by keeping a redundant stack running in a different region. In the case of the AWS outage, Reddit and other large corporations might consider setting up a stack in a western region.

However, the more budget-friendly option is to set up standby databases and redundant app servers in a different availability zone from your live application. By taking these precautions, you've considerably reduced the downtime risk for your app.  

During the most recent AWS outage, stack.io was able to help a client quickly bring their app back online by making use of their redundant infrastructure. The AWS outage caused our client’s app to go down because some of their databases were stored in the volumes affected by the irreversible hardware damage. 

As part of our redundancy best practice, our team of public cloud infrastructure consultants had set up read replicas of the affected master databases. We were easily able to get the app back up and running by promoting the read replicas to a temporary masters. 

Implement Failover Where Appropriate

Failover can be manual or automatic, but the general consensus is that you’d want a human to decide whether or not to failover. The main reason being that automatic, custom-scripted failover can mistake intermittent network performance issues as an outage and unnecessarily start the time- and resource-consuming failover process. 

However, some applications come equipped with an efficient, low-risk failover integration. In the case of our affected client’s application, we are in the process of migrating their databases to Amazon Relational Database Service (RDS) with the multi-AZ failover option. Amazon RDS automatically services database replicas in a different availability zone, so the next time an emergency like this occurs, their app should stay live.

If you need help upgrading your infrastructure’s backups and failover strategy, send us a message and we’ll be in touch shortly.