Tag Archives: Outages

The Reality of Disaster Recovery Planning and Testing

As recent events have shown, outages and disasters are a fact of life in this modern world. Given the number of different platforms that data sits on today, we know that disasters can equally come in many shapes and sizes and lead to data loss and impact business continuity. Because major wide scale disasters occur way less often than smaller disasters from within a datacenter, it’s important to plan and test cloud disaster recovery models for smaller disasters that can happen at different levels of the platform stack.

Because disasters can lead to revenue, productivity and reputation loss, it’s important to understand that having cloud based backup is just one piece of the data protection puzzle. Here at Veeam, we empower our cloud and service providers to offer services based on Veeam Cloud Connect Backup and Replication. However, the planning and testing of what happens once disaster strikes is ultimately up to either the organizations purchasing the services or the services company offering Disaster Recovery as a Service (DRaaS) that is wrapped around backup and replication offerings.

Why it’s Important to Plan:

In theory, planning for a disaster should be completed before selecting a product or solution. In reality, it’s common for organizations to purchase cloud DR services without an understanding of what needs to be put in place prior to workloads being backed up or replicated to a cloud provider or platform. Concepts like recovery time and recovery point objectives (RTPO) need to be understood and planned so, if a disaster strikes and failover has occurred, applications will not only be recovered within SLAs, but also that data on those recovered workloads will be useful in terms of its age.

Smaller RTPO values go hand-in-hand with increased complexity and administrative services overhead. When planning ahead, it’s important to size your cloud disaster platform and build the right disaster recovery model that’s tailored to your needs. When designing your DR plan, you will want to target strategies that relate to your core line of business applications and data.

A staged approach to recovery means that you recover tier-one applications first so the business can still function. A common tier-one application example is the mail server. Another is payroll systems, which could result in an organization being unable to pay its suppliers. Once your key applications and services are recovered, you can move on to recovering data. Keeping mind that archival data generally doesn’t need to be recovered first. Again, being able to categorize systems where your data sits and then working those categories into your recovery plan is important.

Planning should also include specific tasks and controls that need to be followed up on and adhered to during a disaster. It’s important to have specific run books executed by specific people for a smoother failover. Finally, it is critical to make sure that all IT staff know how to accessing applications and services after failover.

Why it’s Important to Test:

When talking about cloud based disaster recovery models, there are a number of factors to consider before a final sign-off and validation of the testing process. Once your plan is in place, test it regularly and make adjustments if issues arise from your tests. Partial failover testing should be treated with the same level of criticality as full failover testing.

Testing your DR plan ensures that business continuity can be achieved in a partial or full disaster. Beyond core backup and replication services testing, you should also test networking, server and application performances. Testing should even include situational testing with staff to be sure that they are able to efficiently access key business applications.

Cloud Disaster Recovery Models:

There are a number of different cloud disaster recovery models, that can be broken down into three main categories:

  • Private cloud
  • Hybrid cloud
  • Public cloud

Veeam Cloud Connect technology works for hybrid and public cloud models, while Veeam Backup & Replication works across all three models. The Veeam Cloud & Service Provider (VCSP) program offers Veeam Cloud Connect backup and replication classified as hybrid clouds offering RaaS (recovery-as-a-service). Public clouds, such as AWS and Azure, can be used with Veeam Backup & Replication to restore VM workloads. Private clouds are generally internal to organizations and leverage Veeam Backup & Replication to replicate or back up or for a backup copy of VMs between datacenter locations.

The ultimate goal here is to choose a cloud recovery model that best suits your organization. Each of the models above offer technological diversity and different price points. They also plan and test differently in order to, ultimately, execute a disaster plan.

When a partial or full disaster strikes, a thoroughly planned and well-tested DR plan, backed by the right disaster recovery model, will help you avoid a negative impact on your organization’s bottom line. Veeam and its cloud partners, service-provider partners and public cloud partners can help you build a solution that’s right for you.

First Published on veeam.com by me – modified and updated for republish today  

The Reality of Cloud – Outages are Like *holes…

It’s been a bad couple of weeks for cloud services both around the world and locally…Over the last three days we have seen AWS have issues which may have been indirectly related to the Leap Second on Tuesday night and this morning, Azure’s Sydney Zone had serious network connectivity issues which disrupted services for approximately three to four hours.

Closer to home, Zettagrid had a partial outage of our Sydney Zone last Wednesday morning which impacted a small subset of client VMs and services and this was on the back of a major (unnamed) provider in Europe being down for a number of days as pointed out in a blog post by Massimo Re Ferre’ linked below.

http://it20.info/2015/06/iaas-cloud-outages-get-over-it/ 

Massimo struck a cord with me and as the title of Massimo’s blog post suggests it’s time for consumers of public cloud services to get over outages and understand that when it comes to Cloud and IaaS…Outages will happen.

When you hear someone saying “I moved to the cloud because I didn’t want to experience downtime” it is fairly clear to me that you either have been heavily misinformed or you misunderstood what the benefits of a (IaaS) public cloud are

Regardless if you are juggernauts like Amazon, Microsoft or Google…or one of the smaller Service Providers…the reality of cloud services is that outages are a fact of life. Even SaaS based application are susceptible to outages and it must be understood that there is no magic that goes into the architecture of cloud platforms and while every effort goes into ensuring availability and resiliency Massimo sums it up well below.

Put it in (yet) another way: a properly designed public cloud is not intrinsically more reliable than a properly designed Enterprise data center (assuming like for like IT budgets).

That is because sh*t happens…

The reality of what can be done to prevent service disruption is for consumers of cloud services to look beyond the infrastructure and think more around the application. This message isn’t new and the methods undertaken by larger companies when deploying business critical service and applications is starting to change…however not every company can be a NetFlix or a Facebook so in breaking it down to a level that’s achievable for most…the question is.

How can everyday consumers of cloud services architect applications to work around the inevitable system outage?

  1. Think about a multi cloud or hybrid cloud strategy
  2. Look for Cloud Service Providers that have multiple Availability Zones
  3. Make sure that the Availability Zones are independent of one an other
  4. Design and deploy business critical applications across multiple Zones
  5. Watch out for Single Points of Failures within Availability Zones
  6. Employ solid backup and recovery strategies

They key to the points above is to not put all your eggs into one basket and then cry foul when that basket breaks…do not set an expectation whereby you become complacent in the fact that all Cloud Service Providers guarantee a certain level of system up time through SLA’s and then act surprised when an outage occurs. Most providers who are worth their salt do offer separate availability zones…but it’s very much up to the people designing and building services upon Services Provider Clouds to ensure that they are built to take advantage of this fact…you can’t come in stamping your feet and crying foul when the resources that are placed at your disposal to ensure application and service continuity are not taken advantage of.

Do not plan for 100% uptime…it does not exist! Anyone who tries to tell you otherwise is lying! You only have to search online to see that Outages are indeed like Assholes…everyone has them!

References:

http://au.pcmag.com/internet-products/35269/news/aws-outage-takes-down-netflix-pinterest

http://it20.info/2015/06/iaas-cloud-outages-get-over-it/

https://downdetector.com/status/aws-amazon-web-services/news/71670-problems-at-amazon-web-services