Monthly Archives: March 2013

Quick Fix: ESX 4.1 Host Stops Responding When iSCSI LUN is “pulled”

REMOVING DEAD PATHS IN ESX4.1 (version 5 guidance here)

Very quick post in relation to a slightly sticky situation I found myself in this afternoon. I was decommissioning a service which was linked to a VM which had a number of VMDKs, one of which was located on a dedicated VMFS Datastore…the guest OS also had a directly connected iSCSI LUN.

I choose to delete the LUNs first and then move up the stack removing the VMFS and eventually the VM. In this I simply went to the SAN and deleted the disk and disk group resource straight up! (hence the pulled reference in the title) Little was I to know that ESX would have a small fit when I attempted to do any sort of reconfiguration or management on the VM. The first sign of trouble was when I attempted to restart the VM and noticed that the task in vCenter wasn’t progressing. At that point my Nagios/OpsView Service Check’s against the ESX host began to timeout and I lost connectivity to the host in the vCenter Console.

Restarting the ESX management agents wasn’t helping and as this was very much a production host with production VM’s on it my first (and older way of thinking) thought of rebooting it wasn’t acceptable during core business/SLA hours. As knowledge and confidence builds with experience in and around ESX I’ve come to use the ESX(i) shell access more and more…so I jumped into SSH and had a look at what the vmkernal logs where saying.

So from the logs it was obvious the system was having major issues (re)connecting to the device I had just pulled out from under it. On the other hosts in the Cluster the datastore was greyed out and I was unable to delete it from the Storage Config. A re-scan of the HBA’s removed the dead datastore from the storage list so if I still had vCenter access to this host a simple re-scan should have sorted things out. Moving to the command line of the host in question I ran the esxcfg-rescan command:

And at the same time while tailing the vmkernal logs I saw the following entries:

From tailing through those logs the rescan basically detected that the path in question was in use (bound to a datastore where a VMDK was attached to a VM) reporting the “Device is in use by Worlds” error. The e rrors also highlights dead paths due to me removing the LUN while in use.

The point at which the host went into a spin (as viewed by seeing the Could not select Path for device in the vmkernal log) was when I attempted to switch on the VM and the host (still thinking it had access to the VMDK) trying to access all disks.

So lesson learnt. When decommissioning VMFS datastores, don’t pull the LUN from under ESX…remove it gracefully first from vSphere and then you are free to delete on the SAN.

 

DDoS Annihilation – What Can Service Providers Do?

Recently we have experienced a series of DDoS attacks against client hosted sites that resulted in varying level of service outages to hosted services across a section of our hosting platform. In my 10+ years of working in the hosting industry this series of attacks was by far the most intense I’ve experienced and certainly was the most successful in terms of achieving the core goal of a DDoS.

On the one hand, as a collective you might think “…we had been lucky to avoid an attack up to this point” while on the other hand you are dealing with the misguided expectations of clients that you are protected against such attacks and when you explain the realities of a DDoS to a customer who is expecting 100% up-time the responses generally encountered is along the lines of “…I thought you said your service will never go down?” or “…I thought you have full redundancy?”

The absolute reality (that I have no problem in explaining to clients) is that most, if not all service providers are pretty helpless against a DDoS dependent on the size and scale of the attack. In our case we where able to mitigate the service disruption by re-routing all traffic to the affected IP to a NULL route at our carrier edge which reduced the load under which the firewall had been placed under which in turn caused the CPU to spike…making the DDoS successful in it’s end game.

So what can be done to mitigate the risk a DDoS presents? Service Providers can look at spending money by purchasing extremely expensive IDS systems and/or larger capacity routing and firewall devices that might only shield against and attack a little more effectively than less expensive options. An example there is that if a firewall device is capable of 10,000 connections per second and 100,000 total connections a DDoS will look to saturate it’s capability to a point where it’s memory and/or CPU resources are consumed trying to process the attack traffic…upgrading to a device capable of 20,000 connections per second and 200,000 total connections will only serve to buffer the resources that little bit longer which might give you more time to mitigate the attack…but the point that’s made here is that…

…service provider resources will always come off second best if an attack is large enough.

And this is the really scary thing for service providers…if someone (individual or organisation) wants to maliciously target your network and/or a client service hosted on your network and they want to inflict maximum service disruption…the best thing that can be done is attempt to mitigate where possible and ride it out.

There are a number of sites that track and list current and trending DDoS attack frequency and origin…one of the better ones I’ve come across is Prolexic’s real time Attack Tracker linked below.

Companies such as Prolexic generally provide services and physical devices that are linked to global networks that act to shield client networks from attacks similar to ways in which SenderBase.org shields email users from obvious SPAM. In discussions with Steven Crockett (Anittel CTO) he described a service which effectively re-routes traffic at the upstream providers end to route through overseas carrier networks who’s connectivity throughput allows otherwise crippling DDoS traffic to be filtered and cleaned before being sent onto it’s destination. This service isn’t site or service specific but involved routing entire subnets…so at this level it’s much more expensive and holistic than reverse proxy content delivery networks.

Working with a CDN will add protection in the form of a value-add service to current service offerings.

So what alternative measures can service providers take to add some level of protection to their key client/internal services. Unless the SP is loaded with more cash than it knows what to do with (at which point there is a case to scale out/upgrade the hosting platform itsself) the only option is to utilize the services of bigger companies that run dedicated Content Delivery Networks.

CDN companies are popping up all over internet, and while a company like Akamai have dominated the website caching market for many years, CDN’s are becoming more the norm whereby caching of static site content is making way for reverse proxy DNS redirection. In wake of the DDoS attacks experienced recently I’ve been testing a couple of the better known CDN providers. One of the those is CloudFlare. The way that a CloudFlare, or Amazon Web Services CloudFront works is by taking over a websites DNS records and use geo-routing to distribute visitors through their CDN network which also filters for potential DDoS or other malicious traffic that would otherwise hit the origin web server.

CDN services are charged generally on a usage basis which commoditizes the service, however CloudFlare charge per site, with their business plans going around the $200 per month mark. For a service providers customer after added insurance against a DDoS or even to generally attempt to increase site responsiveness and performance I believe it’s a no brainier in the age of increasingly brutal DDoS attacks to offer these services as a value-add. At the end of the day the more sites a Service Provider fronts with CDN’s the better able their own hosting network will be able to deal with the inevitability of a DDoS.

One final point to make on going down the CDN path is to ensure that customers understand that their sites are still subject to downtime…this is best illustrated by CloudFlare’s recent outage on the 3rd of March 2013, due to a router bug propagated into their network during a routine DDoS prevention exercise. To their credit, they where very open and transparent of the Root Cause while sites where offline for a period of time, there where options available to re-route the site DNS records back to the origin such is the flexibility of offering a service such as this to service provider clients.

A Hypothetical…

So what’s the title all about? DDoS Annihilation? In my opinion we are getting closer to DDoS events on such large scales that they will have the potential to take down the majority of all service provider and carrier networks which, in turn will have huge social and economic impact around the globe. We don’t have to wait for a Coronal Mass Ejection to blackout the planet…a massive DDoS has the ability to inflict severe damage.

Near on 1 Billion internet hosts used against us in an global DDoS?? No network has the ability to handle that!