The Reality of Cloud – Outages are Like *holes…

It’s been a bad couple of weeks for cloud services both around the world and locally…Over the last three days we have seen AWS have issues which may have been indirectly related to the Leap Second on Tuesday night and this morning, Azure’s Sydney Zone had serious network connectivity issues which disrupted services for approximately three to four hours.

Closer to home, Zettagrid had a partial outage of our Sydney Zone last Wednesday morning which impacted a small subset of client VMs and services and this was on the back of a major (unnamed) provider in Europe being down for a number of days as pointed out in a blog post by Massimo Re Ferre’ linked below.

http://it20.info/2015/06/iaas-cloud-outages-get-over-it/ 

Massimo struck a cord with me and as the title of Massimo’s blog post suggests it’s time for consumers of public cloud services to get over outages and understand that when it comes to Cloud and IaaS…Outages will happen.

When you hear someone saying “I moved to the cloud because I didn’t want to experience downtime” it is fairly clear to me that you either have been heavily misinformed or you misunderstood what the benefits of a (IaaS) public cloud are

Regardless if you are juggernauts like Amazon, Microsoft or Google…or one of the smaller Service Providers…the reality of cloud services is that outages are a fact of life. Even SaaS based application are susceptible to outages and it must be understood that there is no magic that goes into the architecture of cloud platforms and while every effort goes into ensuring availability and resiliency Massimo sums it up well below.

Put it in (yet) another way: a properly designed public cloud is not intrinsically more reliable than a properly designed Enterprise data center (assuming like for like IT budgets).

That is because sh*t happens…

The reality of what can be done to prevent service disruption is for consumers of cloud services to look beyond the infrastructure and think more around the application. This message isn’t new and the methods undertaken by larger companies when deploying business critical service and applications is starting to change…however not every company can be a NetFlix or a Facebook so in breaking it down to a level that’s achievable for most…the question is.

How can everyday consumers of cloud services architect applications to work around the inevitable system outage?

  1. Think about a multi cloud or hybrid cloud strategy
  2. Look for Cloud Service Providers that have multiple Availability Zones
  3. Make sure that the Availability Zones are independent of one an other
  4. Design and deploy business critical applications across multiple Zones
  5. Watch out for Single Points of Failures within Availability Zones
  6. Employ solid backup and recovery strategies

They key to the points above is to not put all your eggs into one basket and then cry foul when that basket breaks…do not set an expectation whereby you become complacent in the fact that all Cloud Service Providers guarantee a certain level of system up time through SLA’s and then act surprised when an outage occurs. Most providers who are worth their salt do offer separate availability zones…but it’s very much up to the people designing and building services upon Services Provider Clouds to ensure that they are built to take advantage of this fact…you can’t come in stamping your feet and crying foul when the resources that are placed at your disposal to ensure application and service continuity are not taken advantage of.

Do not plan for 100% uptime…it does not exist! Anyone who tries to tell you otherwise is lying! You only have to search online to see that Outages are indeed like Assholes…everyone has them!

References:

http://au.pcmag.com/internet-products/35269/news/aws-outage-takes-down-netflix-pinterest

http://it20.info/2015/06/iaas-cloud-outages-get-over-it/

https://downdetector.com/status/aws-amazon-web-services/news/71670-problems-at-amazon-web-services

3 comments

  • James Young

    Do we think then that Cloud providers should pay a financial penalty when the service isn’t available? I think that’s a fair solution.

    I think you are kidding yourself if you think that a separate availability zone is going to save your bacon either. If you make a service easily available then it increases the attack surface. Occasionally, being a cloud, someone or lots of someones are going to get rained on.

  • http://www.linkedin.com/grp/post/36781-6023640881364488193

    Roland Wenzke
    Pretty much stating the obvious!
    Like with all Business and IT-Services no matter whether they are internal, external or hybrid. A simple risk analysis of every service, especially of business critical services, should be done anyways. What will it cost me if service x is not available for a minute/a day/a week and plan accordingly for any outage. Especially in “cloud” type of services there is no guarantee that a service is available for ever, they even could disappear from the market a lot quicker that anybody would expect.

  • http://www.linkedin.com/grp/post/3094564-6023638341138796548

    Amar Prusty
    Cloud does not guarantee high availability with 0 downtime-Expert Architects have to design the solution with no single point of failures

    Don Wood
    Plan for the worst and hope for the best. If you can build your applications to be multi-site HA then do so, otherwise, have a way to fail them over quickly and without human intervention. Most importantly, don’t lock yourself into one provider! it’s always best to have options and that is the new role of IT… creating viable options out of a new sea of choices and dead ends.

    Scott McManus
    I agree with Amar Design matters. There are so many ways that one can have their Cloud Infrastructure designed.

Leave a Reply