With the advent of Hosting and Cloud technologies the pain once reserved for the enterprise or small business is now amplified on a much larger scale (dare I say webscale) and for those working in the Service Provider industry we now have much more responsibility to ensure continuity of an end to end, sometimes converged platform on which outages and service disruption means multiple businesses are affected at once.
Those that know me and follow my posts know that I am a fan of the saying…
Outages are like Assholes…Everybody has them!
I’ve worked in the Service Provider industry all my working life, so I have certainly had my fair share of outages…in my experience a hosting platform suffers a significant outage once every 18-24 months with the common causes of most extended outages and service degradation being network or storage.
Certainly where most pain is felt is at the storage layer. Both Storage Area Networks and Backup Applications in my experience have been the cause for most outages and pain in my career and if you search through Google for Service Provider outages over the last 10 or so years the majority are traced back to storage issues.
For years we where at the mercy of the big Storage Vendors and had been dealing with the issues of monolithic SANs with a bunch of spinning disks with some cache drives all sitting behind a couple of headers…more recently we have seen the emergence of new storage vendors who all seem to promise a world where IOPS are in the millions, latency is non existent and (non)disruptive seems to be the general message. Platform Architects around the world are to a certain extent putting trust in technology that…for the most part, has always had issues and seem to work “most” of the time.
The problem with putting our collective faith in what the vendors promise is that it takes one bad outage to really screw things up. Generally when storage has issues…it really has issues. My experience is that for all the vendor guarantees you get to truly validate ones storage decision during the first bad outage. Typically things work well initially but struggle at scale and this is the single biggest issue with storage today…reliability of performance from the first consumed block of data through to the last.
One thing that frustrates me with the industry is that we effectively fork out the cost of a small house (depending on where you live) yet the industry seems to accept storage issues as a way of life…there is no doubt we are at the mercy of the vendors. They promise the world, take the cash and wait for the inevitable support call. This is where IT Professionals who are responsible for the management and operations of these platforms are at a hiding to nothing…no matter which way it’s spun an outage falls on the Service Provider…and even though it’s 100% an issue within the storage vendors system the end users couldn’t care less…and unless there is public shaming storage vendors will move onto the next hundred 200K sales.
To be fair we are currently going through an exciting time in the storage industry where there is more choice than ever and there are more players coming into the market trying to solve the issues that have plagued platforms for years…a lot of my industry peers and mates work for the new breed of Scale Out, AFA, hyerpconverged and Flash Cache vendors and there is certainly more choice than ever…but the fact still remains that there is risk in every purchase decision and ultimately proof is in the pudding.
Even though at some level we love the pain…surely there is a future where outages based on storage are less common place.