REMOVING DEAD PATHS IN ESX4.1 (version 5 guidance here)
Very quick post in relation to a slightly sticky situation I found myself in this afternoon. I was decommissioning a service which was linked to a VM which had a number of VMDKs, one of which was located on a dedicated VMFS Datastore…the guest OS also had a directly connected iSCSI LUN.
I choose to delete the LUNs first and then move up the stack removing the VMFS and eventually the VM. In this I simply went to the SAN and deleted the disk and disk group resource straight up! (hence the pulled reference in the title) Little was I to know that ESX would have a small fit when I attempted to do any sort of reconfiguration or management on the VM. The first sign of trouble was when I attempted to restart the VM and noticed that the task in vCenter wasn’t progressing. At that point my Nagios/OpsView Service Check’s against the ESX host began to timeout and I lost connectivity to the host in the vCenter Console.
Restarting the ESX management agents wasn’t helping and as this was very much a production host with production VM’s on it my first (and older way of thinking) thought of rebooting it wasn’t acceptable during core business/SLA hours. As knowledge and confidence builds with experience in and around ESX I’ve come to use the ESX(i) shell access more and more…so I jumped into SSH and had a look at what the vmkernal logs where saying.
1 2 3 |
Mar 11 17:55:55 esx03 vmkernel: 393:13:48:38.873 cpu8:4222)NMP: nmp_DeviceUpdatePathStates: Activated path "NULL" for NMP device "naa.6782bcb00014ebe60000035e4de4314c". Mar 11 17:55:55 esx03 vmkernel: 393:13:48:38.874 cpu12:4265)WARNING: vmw_psp_rr: psp_rrSelectPath: Could not select path for device "naa.6782bcb00014ebe60000035e4de4314c". Mar 11 17:55:56 esx03 vmkernel: 393:13:48:39.873 cpu11:4223)WARNING: vmw_psp_rr: psp_rrSelectPathToActivate: Could not select path for device "naa.6782bcb00014ebe60000035e4de4314c. |
So from the logs it was obvious the system was having major issues (re)connecting to the device I had just pulled out from under it. On the other hosts in the Cluster the datastore was greyed out and I was unable to delete it from the Storage Config. A re-scan of the HBA’s removed the dead datastore from the storage list so if I still had vCenter access to this host a simple re-scan should have sorted things out. Moving to the command line of the host in question I ran the esxcfg-rescan command:
1 2 3 4 |
[root@esx03 log]# esxcfg-rescan vmhba39 Dead path vmhba39:C1:T0:L3 for device naa.6782bcb00014ebe60000035e4de4314c not removed. Device is in use by worlds: World # of Handles Name |
And at the same time while tailing the vmkernal logs I saw the following entries:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
==> vmkernel <== Mar 11 17:56:16 esx03 vmkernel: 393:13:48:59.768 cpu13:4118)Vol3: 644: Could not open device 'naa.6782bcb00014ebe60000035e4de4314c:1' for volume open: I/O error Mar 11 17:56:16 esx03 vmkernel: 393:13:48:59.768 cpu13:4118)FSS: 735: Failed to get object f530 28 1 4de4a1f8 3002130c 21000ff6 5abda09b 0 0 0 0 0 0 0 :I/O error Mar 11 17:56:16 esx03 vmkernel: 393:13:48:59.768 cpu13:4118)WARNING: Fil3: 1987: Failed to reserve volume f530 28 1 4de4a1f8 3002130c 21000ff6 5abda09b 0 0 0 0 0 0 0 Mar 11 17:56:16 esx03 vmkernel: 393:13:48:59.768 cpu13:4118)FSS: 735: Failed to get object f530 28 2 4de4a1f8 3002130c 21000ff6 5abda09b 4 1 0 0 0 0 0 :I/O error Mar 11 17:56:16 esx03 vmkernel: 393:13:48:59.769 cpu0:4096)VMNIX: VMKFS: 2561: status = -5 Mar 11 17:56:16 esx03 vmkernel: 393:13:48:59.873 cpu9:45315)NMP: nmp_DeviceUpdatePathStates: Activated path "NULL" for NMP device "naa.6782bcb00014ebe60000035e4de4314c". Mar 11 17:56:16 esx03 vmkernel: 393:13:48:59.874 cpu15:4265)WARNING: NMP: nmpDeviceAttemptFailover: Retry world restore device "naa.6782bcb00014ebe60000035e4de4314c" - no more com mands to retry Mar 11 17:56:16 esx03 vmkernel: 393:13:49:00.232 cpu15:4120)WARNING: vmw_psp_rr: psp_rrSelectPath: Could not select path for device "naa.6782bcb00014ebe60000035e4de4314c". Mar 11 17:56:16 esx03 vmkernel: 393:13:49:00.232 cpu15:4120)WARNING: ScsiCore: 1399: Invalid sense buffer: error=0x0, valid=0x0, segment=0x0, key=0x2 Mar 11 17:56:16 esx03 vmkernel: 393:13:49:00.232 cpu15:4120)WARNING: vmw_psp_rr: psp_rrSelectPath: Could not select path for device "naa.6782bcb00014ebe60000035e4de4314c". Mar 11 17:56:16 esx03 vmkernel: 393:13:49:00.232 cpu15:4120)WARNING: NMP: nmp_IssueCommandToDevice: I/O could not be issued to device "naa.6782bcb00014ebe60000035e4de4314c" due to Not found Mar 11 17:56:16 esx03 vmkernel: 393:13:49:00.232 cpu15:4120)ScsiDeviceIO: 1672: Command 0x1a to device "naa.6782bcb00014ebe60000035e4de4314c" failed H:0x1 D:0x0 P:0x0 Possible sen se data: 0x2 0x3a 0x0. Mar 11 17:56:16 esx03 vmkernel: 393:13:49:00.232 cpu15:4120)WARNING: ScsiDeviceIO: 5172: READ CAPACITY on device "naa.6782bcb00014ebe60000035e4de4314c" from Plugin "NMP" failed. I /O error Mar 11 17:56:16 esx03 vmkernel: 393:13:49:00.232 cpu15:4120)Vol3: 644: Could not open device 'naa.6782bcb00014ebe60000035e4de4314c:1' for volume open: I/O error Mar 11 17:56:16 esx03 vmkernel: 393:13:49:00.232 cpu15:4120)FSS: 3924: No FS driver claimed device 'naa.6782bcb00014ebe60000035e4de4314c:1': Not supported Mar 11 17:57:18 esx03 vmkernel: 393:13:50:02.431 cpu15:40621)WARNING: vmw_psp_rr: psp_rrSelectPathToActivate: Could not select path for device "naa.6782bcb00014ebe60000035e4de4314 c". Mar 11 17:57:18 esx03 vmkernel: 393:13:50:02.431 cpu15:40621)NMP: nmp_DeviceUpdatePathStates: Activated path "NULL" for NMP device "naa.6782bcb00014ebe60000035e4de4314c". |
From tailing through those logs the rescan basically detected that the path in question was in use (bound to a datastore where a VMDK was attached to a VM) reporting the “Device is in use by Worlds” error. The e rrors also highlights dead paths due to me removing the LUN while in use.
The point at which the host went into a spin (as viewed by seeing the Could not select Path for device in the vmkernal log) was when I attempted to switch on the VM and the host (still thinking it had access to the VMDK) trying to access all disks.
So lesson learnt. When decommissioning VMFS datastores, don’t pull the LUN from under ESX…remove it gracefully first from vSphere and then you are free to delete on the SAN.