NSX Bytes: Important Bug in 6.2.4 to be Aware of

[UPDATE] In light of this post being quoted on The Register I wanted to clarify a couple of things. First off, as mentioned there is a fix for this issue (the KB should be rewritten to clearly state that) and secondly, if you read below, you will see that I did not state that just about anyone running NSX-v 6.2.4 will be impacted. Greenfield deployments are not impacted.

Here we go again…I thought maybe we where over these, but it looks like NSX-v 6.2.4 contains a fairly serious bug impacting VMs after vMotion operations. I had intended to write about this earlier in the week when I first became aware of the issue, however the last couple of days have gotten away from me. That said, please be aware of this issue as it will impact those who have upgraded NSX-v from 6.1.x to 6.2.4.

As the KB states, the issue appears if you have the Distributed Firewall enabled (it’s enabled and inline by default) and you have upgraded NSX-v from 6.1.x to 6.2.3 and above, though for most this should be applicable to 6.2.4 upgrades due to all this issues in 6.2.3. If VM’s are migrated between upgraded hosts they will loose network connectivity and require a reboot to bring back connectivity.

If you check the vmkernal.log file you will see similar entries to that below.

Cause

This issue occurs when the VSIP module at the kernel level does not handle the export_version deployed in NSX for vSphere 6.1.x correctly during the upgrade process.

The is no current resolution to the issue apart from the VM reboot but there is a workaround in the form of a script that can be obtained via GSS if you reference KB2146171. Hopefully there will be a proper fix in future NSX releases.

<RANT>

I can’t believe something as serious as this was missed by QA for what is VMware’s flagship product. It’s beyond me that this sort of error wasn’t picked up in testing before it was released. It’s simply not good enough that a major release goes out with this sort of bug and I don’t know how it keeps on happening. This one specifically impacted customers and for service providers or enterprises that upgraded in good faith, it puts egg of the faces of those who approve, update and execute the upgrades that results in unhappy customers or internal users.

Most organisations can’t fully replicate production situations when testing upgrades due to lack or resources or lack of real world situation testing…VMware could and should have the resources to stop these bugs leaking into release builds. For now, if possible I would suggest that people add more stringent vMotion tests as part of NSX-v lab testing before promoting into production moving forward.

VMware customers shouldn’t have to be the ones discovering these bugs!

</RANT>

[UPDATE] While I am obviously not happy about this issue coming in the wake of previous issues, I still believe in NSX and would recommend all shops looking to automate networking still have faith in what the platform offers. Bug’s will happen…I get that, but I know in the long run there is huge benefit in running NSX.

References:

https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2146171

7 comments

  • Joe Phillips

    Hear hear! I like the product line and the capabilities it brings to a virtualized environment, but QA has been a major bugaboo for NSX for quite some time now. Shouldn’t keep hapoening, but it seems to over and over again.

  • Thanks , You show the important bug in NSX-v6.2.4. I was not aware about this bug . I faced the same issue when we have upgraded NSX-v from 6.1.x to 6.2.4 but that time I was not aware about this .It’s simply not good enough that a major release goes out with this sort of bug .Is there any permanent solution of this issue . If you find something please let me know. I think I need to raise a ticket also with VMware technical support team . What do you suggest?

  • @Dale:Thanks for your response . I will check .

  • Nelson Reyes

    An alternative that works for me was disable/enable DFW on all hosts. But is better if you open a case with GSS.

  • Glynn Matthews

    We got stung with this (but going from version 6.2.2 to 6.2.4). The issue we had was some hosts didn’t update their VSIP version correctly (even though the updater showed NSX as being all on the latest version). What happened then was when a VM moved from an updated host to one of the affected hosts, the version was incompatible (same PFImportState: unsupported version: – but the number was different) and caused the NIC to drop (lose the tick). The initial workaround was to scan with a PowerShell script to find VMs with a dropped NIC and then reconnect it as quick as possible. The proper fix was to manually move all VMs off the affected host and giving it a reboot has brought it to the correct version (the update was in the bootbank – just didn’t get applied after the first reboot). I suspect something went awry with the update process via vSphere Web (the client timed out late in the process – i suspect this may have contributed to something not completing correctly).

    Very Frustrating VMware – some proper acknowledgement of the matter would be appreciated (some 2 months after the release date of 6.2.4). Why is there no real public KB articles reporting the issue? The VMware Engineer suggested they had Internal KB references about the matter. Even a mail-out/broadcast to all NSX customers advising them to hold off and to review the known scenarios that are affected by this serious flaw.

    I had over 100 VMs affected by this (and some very annoyed clients). It undermines the confidence in the product and also my ability to perform this role.

    I’ll be far more cautious in the future about updating NSX..

  • Glynn Matthews

    In addition to this, the workaround we were told to do (to update each of the VM components to the latest version) was to add every VM to an exclusion list – then remove them (forcing the version to be re-written with the correct/latest structure).

Leave a Reply