How to plan for disaster recovery on the cloud

How to plan for disaster recovery on the cloud.png

Unless they’re a web developer or designer, most people don’t look at a well-functioning site and think, “Wow I’m having a great time navigating this interface.” However, the moment a website goes down or stops working people will remember. The stress of missing a time-critical action on your site. The inconvenience of having to figure out something else to do in the meanwhile. The annoyance of not being able to do what they wanted to do.

Uptime is a critical component for any business that hosts a web application. This is because your uptime informs you of the health of your website, which directly translates into customer satisfaction. A 2016 study from Ponemon Institute estimates that the average outage costs a business almost $9,000 per minute. To mitigate losses across all fronts, every business should have a tried-and-tested disaster recovery plan.

Today, we’re going to review disaster recovery on the cloud and how your business can prepare for outages.

What is disaster recovery on the cloud?

In general, disaster recovery is risk planning for negative events that stop mission-critical functions of a business. The idea behind a disaster recovery plan is to get those functions up and running again as fast as possible.

Specifically, disaster recovery on the cloud is more of a backup and restore strategy: your web application goes down and you need to recover the data, applications, and infrastructure, or you need failover to kick in. Whether it was a natural disaster, a cyber attack, or planned infrastructure work, production was shut down and you need it back up.

How disaster recovery on the cloud is different from traditional disaster recovery 

Disaster recovery on the cloud differs from traditional infrastructure due to the availability of virtualization on the cloud. As a result, infrastructure based on the cloud is less dependent on physical hardware than traditional on-premises solutions. 

With an on-premises solution you would have to build a second data centre in order to store a backup, which means that if both data centres were compromised, you’d be in a disaster recovery crisis. 

On the flip side, with a cloud-based environment, you can transfer the data, apps, and infrastructure over to a different datacenter and spin it up on another virtual server within minutes. At stack.io, we use techniques such as Infrastructure as Code to store a blueprint of the cloud services and resources as well as configuration management to configure the services and infrastructure. These techniques allow for quick provisioning, ease, and reproducibility. For the actual data, we implement database backups that we can restore to the last saved state, if needed. Of course, the cloud is not infallible as you may remember from the AWS outage of August 2019, but the chances of something like that happening again is pretty low.

An effective plan on the cloud shifts the disaster recovery tradeoff curve to the left, resulting in lower overall costs for faster recovery.

 
Disaster Recovery Graph.png
 

The process 

Complete a risk assessment 

The first step to creating a disaster recovery plan is to do a complete risk assessment of your infrastructure. At a minimum, as you brainstorm the different threats that your infrastructure faces, you should cover these three main questions:

  1. Is the threat natural or man-made?

  2. How likely is the threat to occur? 

  3. How preventable is the threat?

Here are some examples of common risks that web application-based companies face.

Physical hardware failure

  1. Is the threat natural or man-made? Could be either

  2. How likely is the threat to occur? Less likely on the cloud where there’s less dependence on physical hardware

  3. How preventable is the threat? Not preventable because it’s out of the business’ control

Information security issue

  1. Is the threat natural or man-made? Man-made

  2. How likely is the threat to occur? Moderately likely, relative to other threats

  3. How preventable is the threat? Preventable to a degree, by implementing security measures

Unexpected user behaviour

  1. Is the threat natural or man-made? Man-made

  2. How likely is the threat to occur? More likely during higher risk periods for certain industries

  3. How preventable is the threat? Preventable to a degree, by doing user testing

Keep in mind that although this preliminary risk assessment will inform the first draft of your disaster recovery strategy, you should regularly go through the assessment and continually define the risks. Not having an ongoing process for ongoing risk identification in itself could be considered a risk!

Create a plan 

Once you understand the threats to your infrastructure, you can start developing a plan to address them. However, before you jump into specific resolution scenarios, you want to define two key objectives that your organization will want to achieve if your web application suffers an outage. 

Recovery time. You want to identify the ideal range within which you get your infrastructure back up and running with the lowest impact on your visitors. If during a crisis, you’re outside of this range, then your disaster recovery plan likely needs improvements.  

Recovery point. You want to make sure that if you need to roll back the state of your infrastructure that you’ve a) rolled back to a point in time before the issue occurred (especially if it was a man-made threat like a virus) and b) minimized the inconvenience for your visitors. 

After determining your targets for recovery time and point, you can start to map out the plan for each of your threat scenarios. For each threat, there are four components to an end-to-end recovery plan. 

Prevention. This part of your plan is ongoing actions that you can take in order to prevent a disaster. These are tasks like implementing a good backup and restore mechanism, enhancing the security of your environments, and implementing monitoring and alerting to detect issues earlier.

Preparation. The steps in preparation inform you what your actions are if a crisis is occurring. These should involve the tasks required, when they should be performed, and who is accountable for each task.

Response. The criteria for this section help you define what triggers a disaster recovery plan. Not only do you need to consider the incident, but who makes the executive call, and how you’re going to inform your team.

Recovery. This is what you’ll need to do post-incident. Actions here may include reviewing the incident and trying to implement more preventative measures, analyzing the disaster recovery process for areas of improvement, and more. 

Educate your entire team

Although disaster recovery often only involves small sections of the larger team, it’s important that the entire team is aware of the plan. Even though operations may be primarily responsible for the plan, you wouldn’t want other team members unknowingly interfering with the process! 

Test and improve the process 

As with all processes, they should be continually tested and improved upon. Especially with something that could affect mission-critical functions and maintaining uptime, there’s definitely added pressure to make sure that the plan runs smoothly. 

At a minimum, you should be reviewing and testing the disaster recovery plan once a year, but there are companies that test as often as once per month!

Conclusion

The need to maintain uptime and consequently customer satisfaction makes uptime a priority for any company with a web application. As a result, it’s crucial that organizations develop a tried-and-tested disaster recovery plan to lessen the impact of downtime.

Do you need help creating a disaster recovery plan? Send us a message and we’ll reach out as soon as we can.