PSOD Warning: IBM HS23 Blades, Emulex 10GB Network Adaptors and ESXi 5.1

This is a quick informational post to warn anyone running an Emulex based 10GbE Converged Ethernet adapter in an IBM Blade Center with HS23 Series Blade servers with ESXi 5.x … Fix is below.

You are at risk of simultaneous host failure via the VMware PSOD under unspecified conditions if you do not update to the latest combination FW from IBM (Emulex) and be2net driver from VMware.

While the trigger is still unclear (some suggestion linking guest based operations in combination with DVS environments) …searching online for PSOD + ESX + Emulex returns many results…it seems to be an issue that’s been around from release. And while it’s not limited to IBM Blades…there are reports of this happening on HP/Emulex platforms, this post is specific to the IBM HS23 Blades.

The below KB links suggest issues in FC or FCoE SAN management stacks, but in our situation we where running iSCSI software initiators with the following revision of Emulex FW and be2net driver. It must be noted that this combination had been stable for 7 months.

This Emulex KB suggests that the condition is triggered when frames sourced from the SAN management are aborted. Once the abort occurs the conditions for a PSOD exist. One particular example of an aborted frame is if the target does not respond to a request. For example the management application sends a read to a controller LUN but the controller LUN does not respond. The driver will then send out an abort for this particular read command.

VMware KB2031192: As of the 6th of May, VMware is suggesting that you update the Emulex FW to at least version 4.6.146.62. We struggled with IBM support to get access to an updated FW 4.6.166.9 (direct download to FW Deployment ISO here) and VMware support also suggested upgrading the be2net driver to at least 4.6.142.10 which was released on the 6th of June (direct download here). We later confirmed with VMware that running the driver version slightly behind the FW version was ok.

At this point in time our systems have been stable since applying the updates, but the is the second occurrence of the PSOD since the IBM Blade System has gone into production (within the last 7-8 months) and confidence in the platform has taken a hit. We, along with other users of this Emulex based platform will be hoping this is the last round of issues with this card…I struggle to understand how such a serious issue can be allowed to manifest. But this combination seems problematic!

More Links:

http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2039912

http://emulex.force.com/knowledgebase/articles/Knowledgebase_Article/PSOD-Discovery-in-LightPulse-and-OneConnect-Adapters-on-ESX-ESXi-Using-SAN-or-Adapter-Management-Applications/

http://www.redbooks.ibm.com/abstracts/tips0828.html

https://www.ibm.com/developerworks/community/forums/html/topic?id=77777777-0000-0000-0000-000014903697

http://communities.vmware.com/thread/422381?start=0&tstart=0http://vstorage.wordpress.com/2010/04/25/ibm-bladecenter-virtual-fabric-solution-and-vsphere/