Quick Fix: ESX 4.1 Host Stops Responding When iSCSI LUN is "pulled"

REMOVING DEAD PATHS IN ESX4.1 (version 5 guidance (http://pubs.vmware.com/vsphere-50/index.jsp?topic=%2Fcom.vmware.vcli.examples.doc_50%2Fcli_manage_files.5.6.html)) Very quick post in relation to a slightly sticky situation I found myself in this afternoon. I was decommissioning a service which was linked to a VM which had a number of VMDKs, one of which was located on a dedicated VMFS Datastore…the guest OS also had a directly connected iSCSI LUN. I choose to delete the LUNs first and then move up the stack removing the VMFS and eventually the VM. In this I simply went to the SAN and deleted the disk and disk group resource straight up! (hence the pulled reference in the title) Little was I to know that ESX would have a small fit when I attempted to do any sort of reconfiguration or management on the VM. The first sign of trouble was when I attempted to restart the VM and noticed that the task in vCenter wasn’t progressing. At that point my Nagios/OpsView Service Check’s against the ESX host began to timeout and I lost connectivity to the host in the vCenter Console. Restarting the ESX management agents wasn’t helping and as this was very much a production host with production VM’s on it my first (and older way of thinking) thought of rebooting it wasn’t acceptable during core business/SLA hours. As knowledge and confidence builds with experience in and around ESX I’ve come to use the ESX(i) shell access more and more…so I jumped into SSH and had a look at what the vmkernal logs where saying.

Mar 11 17:55:55 esx03 vmkernel: 393:13:48:38.873 cpu8:4222)NMP: nmp_DeviceUpdatePathStates: Activated path "NULL" for NMP device "naa.6782bcb00014ebe60000035e4de4314c".
Mar 11 17:55:55 esx03 vmkernel: 393:13:48:38.874 cpu12:4265)WARNING: vmw_psp_rr: psp_rrSelectPath: Could not select path for device "naa.6782bcb00014ebe60000035e4de4314c".
Mar 11 17:55:56 esx03 vmkernel: 393:13:48:39.873 cpu11:4223)WARNING: vmw_psp_rr: psp_rrSelectPathToActivate: Could not select path for device "naa.6782bcb00014ebe60000035e4de4314c.

So from the logs it was obvious the system was having major issues (re)connecting to the device I had just pulled out from under it. On the other hosts in the Cluster the datastore was greyed out and I was unable to delete it from the Storage Config. A re-scan of the HBA’s removed the dead datastore from the storage list so if I still had vCenter access to this host a simple re-scan should have sorted things out. Moving to the command line of the host in question I ran the esxcfg-rescan command:

# esxcfg-rescan vmhba39
Dead path vmhba39:C1:T0:L3 for device naa.6782bcb00014ebe60000035e4de4314c not removed.
Device is in use by worlds:
 World # of Handles Name

And at the same time while tailing the vmkernal logs I saw the following entries:

==> vmkernel <==
Mar 11 17:56:16 esx03 vmkernel: 393:13:48:59.768 cpu13:4118)Vol3: 644: Could not open device 'naa.6782bcb00014ebe60000035e4de4314c:1' for volume open: I/O error
Mar 11 17:56:16 esx03 vmkernel: 393:13:48:59.768 cpu13:4118)FSS: 735: Failed to get object f530 28 1 4de4a1f8 3002130c 21000ff6 5abda09b 0 0 0 0 0 0 0 :I/O error
Mar 11 17:56:16 esx03 vmkernel: 393:13:48:59.768 cpu13:4118)WARNING: Fil3: 1987: Failed to reserve volume f530 28 1 4de4a1f8 3002130c 21000ff6 5abda09b 0 0 0 0 0 0 0
Mar 11 17:56:16 esx03 vmkernel: 393:13:48:59.768 cpu13:4118)FSS: 735: Failed to get object f530 28 2 4de4a1f8 3002130c 21000ff6 5abda09b 4 1 0 0 0 0 0 :I/O error
Mar 11 17:56:16 esx03 vmkernel: 393:13:48:59.769 cpu0:4096)VMNIX: VMKFS: 2561: status = -5
Mar 11 17:56:16 esx03 vmkernel: 393:13:48:59.873 cpu9:45315)NMP: nmp_DeviceUpdatePathStates: Activated path "NULL" for NMP device "naa.6782bcb00014ebe60000035e4de4314c".
Mar 11 17:56:16 esx03 vmkernel: 393:13:48:59.874 cpu15:4265)WARNING: NMP: nmpDeviceAttemptFailover: Retry world restore device "naa.6782bcb00014ebe60000035e4de4314c" - no more com mands to retry
Mar 11 17:56:16 esx03 vmkernel: 393:13:49:00.232 cpu15:4120)WARNING: vmw_psp_rr: psp_rrSelectPath: Could not select path for device "naa.6782bcb00014ebe60000035e4de4314c".
Mar 11 17:56:16 esx03 vmkernel: 393:13:49:00.232 cpu15:4120)WARNING: ScsiCore: 1399: Invalid sense buffer: error=0x0, valid=0x0, segment=0x0, key=0x2
Mar 11 17:56:16 esx03 vmkernel: 393:13:49:00.232 cpu15:4120)WARNING: vmw_psp_rr: psp_rrSelectPath: Could not select path for device "naa.6782bcb00014ebe60000035e4de4314c".
Mar 11 17:56:16 esx03 vmkernel: 393:13:49:00.232 cpu15:4120)WARNING: NMP: nmp_IssueCommandToDevice: I/O could not be issued to device "naa.6782bcb00014ebe60000035e4de4314c" due to Not found
Mar 11 17:56:16 esx03 vmkernel: 393:13:49:00.232 cpu15:4120)ScsiDeviceIO: 1672: Command 0x1a to device "naa.6782bcb00014ebe60000035e4de4314c" failed H:0x1 D:0x0 P:0x0 Possible sen se data: 0x2 0x3a 0x0.
Mar 11 17:56:16 esx03 vmkernel: 393:13:49:00.232 cpu15:4120)WARNING: ScsiDeviceIO: 5172: READ CAPACITY on device "naa.6782bcb00014ebe60000035e4de4314c" from Plugin "NMP" failed. I /O error
Mar 11 17:56:16 esx03 vmkernel: 393:13:49:00.232 cpu15:4120)Vol3: 644: Could not open device 'naa.6782bcb00014ebe60000035e4de4314c:1' for volume open: I/O error
Mar 11 17:56:16 esx03 vmkernel: 393:13:49:00.232 cpu15:4120)FSS: 3924: No FS driver claimed device 'naa.6782bcb00014ebe60000035e4de4314c:1': Not supported
Mar 11 17:57:18 esx03 vmkernel: 393:13:50:02.431 cpu15:40621)WARNING: vmw_psp_rr: psp_rrSelectPathToActivate: Could not select path for device "naa.6782bcb00014ebe60000035e4de4314 c".
Mar 11 17:57:18 esx03 vmkernel: 393:13:50:02.431 cpu15:40621)NMP: nmp_DeviceUpdatePathStates: Activated path "NULL" for NMP device "naa.6782bcb00014ebe60000035e4de4314c".

From tailing through those logs the rescan basically detected that the path in question was in use (bound to a datastore where a VMDK was attached to a VM) reporting the “Device is in use by Worlds” error. The e rrors also highlights dead paths due to me removing the LUN while in use. The point at which the host went into a spin (as viewed by seeing the Could not select Path for device in the vmkernal log) was when I attempted to switch on the VM and the host (still thinking it had access to the VMDK) trying to access all disks. So lesson learnt. When decommissioning VMFS datastores, don’t pull the LUN from under ESX…remove it gracefully first from vSphere and then you are free to delete on the SAN.

3 Commentsarchived

Quick Post: Removing Datastore Tags and Mounts with PowerCLI - VIRTUALIZATION IS LIFE!7 July 2015
[…] through my post archive I came across this entry from 2013 that (while relating to ESXi 4.1) shows you that there can be bad consequences if you pull a LUN […]
Jacques E9 April 2016
Just want to find out how you actually got the datastore removed, sitting with esx 4.1 and inactive datastores because their SAN was removed before datastores was removed and I cant get them to go away. And because its 4.1, some API commands are not available.
fjdfkd4 November 2016
Lol is this article a joke! What did you do to fix it. Errors are there after reboot for me.