For the last couple of weeks we have had some intermittent issues where by ESXi network adapters have gone into a disconnected state requiring a host reboot to bring the link back online. Generally it was only one NIC at a time, but in some circumstances both NICs went offline resulting in host failure and VM HA events being triggered. From the console ESXi appears to be up, but each NIC was listed as disconnected and when we checked the switch ports there was no indication of a loss of link.
In the vmkernal logs the following entries are observed:
2016-01-10T23:07:43.382Z cpu29:386390)netsched: NetSchedMClkPortQuiesce:4047: vmnic0: received a force quiesce for port 0x2000020
2016-01-10T23:07:43.382Z cpu29:386390)netsched: NetSchedMClkPortQuiesce:4047: vmnic1: received a force quiesce for port 0x2000020
2016-01-10T23:13:50.134Z cpu1:33502)<3>bnx2x: [bnx2x_attn_int_deasserted3:4816(vmnic0)]MC assert!
2016-01-10T23:13:50.134Z cpu1:33502)<3>bnx2x: [bnx2x_mc_assert:937(vmnic0)]XSTORM_ASSERT_LIST_INDEX 0x2
2016-01-10T23:13:50.134Z cpu1:33502)<3>bnx2x: [bnx2x_mc_assert:951(vmnic0)]XSTORM_ASSERT_INDEX 0x0 = 0x00020000 0x00010015 0x04aa05b4 0x00010053
2016-01-10T23:13:50.134Z cpu1:33502)<3>bnx2x: [bnx2x_mc_assert:965(vmnic0)]Chip Revision: everest3, FW Version: 7_10_51
2016-01-10T23:13:50.134Z cpu1:33502)<3>bnx2x: [bnx2x_attn_int_deasserted3:4822(vmnic0)]driver assert
2016-01-10T23:13:50.134Z cpu1:33502)<3>bnx2x: [bnx2x_panic_dump:1140(vmnic0)]begin crash dump -----------------
After some time working with VMware Support our Ops Engineer @santinidaniel came aross this VMwareKB which described the situation we where seeing. Interestingly enough we only saw this happening after recent host updates to ESXi 5.5 Update 3 builds but as the issue is listed as being present in ESXi 5, 5.5 and 6.0 that might just be a side note.
The cause as listed in the KB is:
This issue occurs when the guest virtual machine sends invalid metadata for TSO packets. The packet length is less than Maximum Segment Size (MSS), but the TSO bit is set. This causes the adapter and driver to go into a non-operational state.
Note: This issue occurs only with VXLAN configured and when there is heavy VXLAN traffic.
It just so happened that we did indeed have a large customer with high use Citrix Terminal Servers using our NSX Advanced Networking…and they where sitting on a VXLAN Virtualwire. The symptoms got worse today that coincided with the first official day of work for the new year.
There is a simple workaround:
# esxcfg-module -s "enable_vxlan_ofld=0" bnx2x
That command has been described in blog posts relating to the Broadcom (which now present as QLogic drivers) drivers and where previously there was no resolution, there is now a fix in place by upgrading to the latest drivers here. Without upgrading to the latest certified drivers the quickest way to avoid the issue is to apply the workaround and reboot the host.
There has been recent outcry bemoaning the lack of QA with some of VMware’s latest releases but the reality is the more bits you add the more likelihood there is for issues to pop up…This is becoming more the case with ESXi as the base virtualization platform continues to add to it’s feature set which now includes VSAN baked in. Host extensions further add to the chance of things going wrong due to situations that are hard to test in as part of the QA process.
Deal, fix…and move on!