It’s been a bad couple of weeks for cloud services both around the world and locally…Over the last three days we have seen AWS have issues which may have been indirectly related to the Leap Second on Tuesday night and this morning, Azure’s Sydney Zone had serious network connectivity issues which disrupted services for approximately three to four hours.

cloud3
Closer to home, Zettagrid had a partial outage of our Sydney Zone last Wednesday morning which impacted a small subset of client VMs and services and this was on the back of a major (unnamed) provider in Europe being down for a number of days as pointed out in a blog post by Massimo Re Ferre’ linked below.

http://it20.info/2015/06/iaas-cloud-outages-get-over-it/ 

Massimo struck a cord with me and as the title of Massimo’s blog post suggests it’s time for consumers of public cloud services to get over outages and understand that when it comes to Cloud and IaaS…Outages will happen.

When you hear someone saying “I moved to the cloud because I didn’t want to experience downtime” it is fairly clear to me that you either have been heavily misinformed or you misunderstood what the benefits of a (IaaS) public cloud are

Regardless if you are juggernauts like Amazon, Microsoft or Google…or one of the smaller Service Providers…the reality of cloud services is that outages are a fact of life. Even SaaS based application are susceptible to outages and it must be understood that there is no magic that goes into the architecture of cloud platforms and while every effort goes into ensuring availability and resiliency Massimo sums it up well below.

Put it in (yet) another way: a properly designed public cloud is not intrinsically more reliable than a properly designed Enterprise data center (assuming like for like IT budgets).

That is because sh*t happens…

The reality of what can be done to prevent service disruption is for consumers of cloud services to look beyond the infrastructure and think more around the application. This message isn’t new and the methods undertaken by larger companies when deploying business critical service and applications is starting to change…however not every company can be a NetFlix or a Facebook so in breaking it down to a level that’s achievable for most…the question is.

How can everyday consumers of cloud services architect applications to work around the inevitable system outage?

  1. Think about a multi cloud or hybrid cloud strategy
  2. Look for Cloud Service Providers that have multiple Availability Zones
  3. Make sure that the Availability Zones are independent of one an other
  4. Design and deploy business critical applications across multiple Zones
  5. Watch out for Single Points of Failures within Availability Zones
  6. Employ solid backup and recovery strategies

They key to the points above is to not put all your eggs into one basket and then cry foul when that basket breaks…do not set an expectation whereby you become complacent in the fact that all Cloud Service Providers guarantee a certain level of system up time through SLA’s and then act surprised when an outage occurs. Most providers who are worth their salt do offer separate availability zones…but it’s very much up to the people designing and building services upon Services Provider Clouds to ensure that they are built to take advantage of this fact…you can’t come in stamping your feet and crying foul when the resources that are placed at your disposal to ensure application and service continuity are not taken advantage of.

Do not plan for 100% uptime…it does not exist! Anyone who tries to tell you otherwise is lying! You only have to search online to see that Outages are indeed like Assholes…everyone has them!

References:

http://au.pcmag.com/internet-products/35269/news/aws-outage-takes-down-netflix-pinterest

http://it20.info/2015/06/iaas-cloud-outages-get-over-it/

https://downdetector.com/status/aws-amazon-web-services/news/71670-problems-at-amazon-web-services