Quick Fix: ESX 4.1 Host Stops Responding When iSCSI LUN is “pulled”

REMOVING DEAD PATHS IN ESX4.1 (version 5 guidance here)

Very quick post in relation to a slightly sticky situation I found myself in this afternoon. I was decommissioning a service which was linked to a VM which had a number of VMDKs, one of which was located on a dedicated VMFS Datastore…the guest OS also had a directly connected iSCSI LUN.

I choose to delete the LUNs first and then move up the stack removing the VMFS and eventually the VM. In this I simply went to the SAN and deleted the disk and disk group resource straight up! (hence the pulled reference in the title) Little was I to know that ESX would have a small fit when I attempted to do any sort of reconfiguration or management on the VM. The first sign of trouble was when I attempted to restart the VM and noticed that the task in vCenter wasn’t progressing. At that point my Nagios/OpsView Service Check’s against the ESX host began to timeout and I lost connectivity to the host in the vCenter Console.

Restarting the ESX management agents wasn’t helping and as this was very much a production host with production VM’s on it my first (and older way of thinking) thought of rebooting it wasn’t acceptable during core business/SLA hours. As knowledge and confidence builds with experience in and around ESX I’ve come to use the ESX(i) shell access more and more…so I jumped into SSH and had a look at what the vmkernal logs where saying.

So from the logs it was obvious the system was having major issues (re)connecting to the device I had just pulled out from under it. On the other hosts in the Cluster the datastore was greyed out and I was unable to delete it from the Storage Config. A re-scan of the HBA’s removed the dead datastore from the storage list so if I still had vCenter access to this host a simple re-scan should have sorted things out. Moving to the command line of the host in question I ran the esxcfg-rescan command:

And at the same time while tailing the vmkernal logs I saw the following entries:

From tailing through those logs the rescan basically detected that the path in question was in use (bound to a datastore where a VMDK was attached to a VM) reporting the “Device is in use by Worlds” error. The e rrors also highlights dead paths due to me removing the LUN while in use.

The point at which the host went into a spin (as viewed by seeing the Could not select Path for device in the vmkernal log) was when I attempted to switch on the VM and the host (still thinking it had access to the VMDK) trying to access all disks.

So lesson learnt. When decommissioning VMFS datastores, don’t pull the LUN from under ESX…remove it gracefully first from vSphere and then you are free to delete on the SAN.