NSX Bytes: Important Bug in 6.2.4 to be Aware of

[UPDATE] In light of this post being quoted on The Register I wanted to clarify a couple of things. First off, as mentioned there is a fix for this issue (the KB should be rewritten to clearly state that) and secondly, if you read below, you will see that I did not state that just about anyone running NSX-v 6.2.4 will be impacted. Greenfield deployments are not impacted.

Here we go again…I thought maybe we where over these, but it looks like NSX-v 6.2.4 contains a fairly serious bug impacting VMs after vMotion operations. I had intended to write about this earlier in the week when I first became aware of the issue, however the last couple of days have gotten away from me. That said, please be aware of this issue as it will impact those who have upgraded NSX-v from 6.1.x to 6.2.4.

As the KB states, the issue appears if you have the Distributed Firewall enabled (it’s enabled and inline by default) and you have upgraded NSX-v from 6.1.x to 6.2.3 and above, though for most this should be applicable to 6.2.4 upgrades due to all this issues in 6.2.3. If VM’s are migrated between upgraded hosts they will loose network connectivity and require a reboot to bring back connectivity.

If you check the vmkernal.log file you will see similar entries to that below.

2016-07-01T07:02:37.357Z cpu7:223405)WARNING: NetDVS: 547: portAlias is NULL
2016-07-01T07:02:37.357Z cpu7:223405)Net: 2312: connected VM eth0 to VM Network, portID 0x200000c
2016-07-01T07:02:37.362Z cpu7:223405)PFImportState: unsupported version: 0
2016-07-01T07:02:37.363Z cpu7:223405)vsip VSIPDVFRestoreState:1912: Failed to restore PF state : Limit exceeded
2016-07-01T07:02:37.363Z cpu7:223405)WARNING: NetPort: 1431: failed to enable port 0x200000c: Failure
2016-07-01T07:02:37.363Z cpu7:223405)NetPort: 1632: disabled port 0x200000c
2016-07-01T07:02:37.363Z cpu7:223405)WARNING: Net: vm 223391: 5353: cannot enable port 0x200000c: Failure
2016-07-01T07:02:37.383Z cpu7:223405)Net: 3354: disconnected client from port 0x200000c

2016-07-01T07:02:37.357Z cpu7:223405)WARNING: NetDVS: 547: portAlias is NULL

2016-07-01T07:02:37.357Z cpu7:223405)Net: 2312: connected VM eth0 to VM Network, portID 0x200000c

2016-07-01T07:02:37.362Z cpu7:223405)PFImportState: unsupported version: 0

2016-07-01T07:02:37.363Z cpu7:223405)vsip VSIPDVFRestoreState:1912: Failed to restore PF state : Limit exceeded

2016-07-01T07:02:37.363Z cpu7:223405)WARNING: NetPort: 1431: failed to enable port 0x200000c: Failure

2016-07-01T07:02:37.363Z cpu7:223405)NetPort: 1632: disabled port 0x200000c

2016-07-01T07:02:37.363Z cpu7:223405)WARNING: Net: vm 223391: 5353: cannot enable port 0x200000c: Failure

2016-07-01T07:02:37.383Z cpu7:223405)Net: 3354: disconnected client from port 0x200000c

Cause

This issue occurs when the VSIP module at the kernel level does not handle the export_version deployed in NSX for vSphere 6.1.x correctly during the upgrade process.

The is no current resolution to the issue apart from the VM reboot but there is a workaround in the form of a script that can be obtained via GSS if you reference KB2146171. Hopefully there will be a proper fix in future NSX releases.

<RANT>

I can’t believe something as serious as this was missed by QA for what is VMware’s flagship product. It’s beyond me that this sort of error wasn’t picked up in testing before it was released. It’s simply not good enough that a major release goes out with this sort of bug and I don’t know how it keeps on happening. This one specifically impacted customers and for service providers or enterprises that upgraded in good faith, it puts egg of the faces of those who approve, update and execute the upgrades that results in unhappy customers or internal users.

Most organisations can’t fully replicate production situations when testing upgrades due to lack or resources or lack of real world situation testing…VMware could and should have the resources to stop these bugs leaking into release builds. For now, if possible I would suggest that people add more stringent vMotion tests as part of NSX-v lab testing before promoting into production moving forward.

VMware customers shouldn’t have to be the ones discovering these bugs!

</RANT>

[UPDATE] While I am obviously not happy about this issue coming in the wake of previous issues, I still believe in NSX and would recommend all shops looking to automate networking still have faith in what the platform offers. Bug’s will happen…I get that, but I know in the long run there is huge benefit in running NSX.

References:

https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2146171

VIRTUALIZATION IS LIFE!