NSX vCloud Retrofit: Upgrade Issue – Edge Gateway Unmanageable in vCloud Director or Deployment Fails

We have been working with VMware GSS on an issue for a number of weeks whereby we were seeing some vShield Edge devices go into an unmanageable state from within the vCloud Director Portal. In a nutshell some VSEs (version 5.5.3) where stuck in a Configuring Loop upon the committal of a Service Config change. Subsequent reboots of the NSX Manager or vCloud Director Cells did not result in the VSE coming out of this state. While the VSE was not able to be managed from vCD the Edge Services where still functional…ie traffic was passing through and all existing rules and features where working as expected.

Looking at the vCD Logs the following entry was seen:

We also saw issues deploying some VSEs from vCloud Director where Deployment of edge gateways failed.

If the failed attempt was retried via a redeployment action the following was seen in the vCD logs with the vCD GUI stuck showing Reploying Edge in Progress

Heading over to the the NSX Manager logs we came across the following error log entry being constantly written to the system manager logs…in fact we were seeing this message pop up approximately 25,000 times a day across three NSX instances.

The VIX API:

The NSX Manager…and vShield Manager before it uses the VIX API to query vCenter and the ESXi Host running the Edge VMs via VMTools to query the status of the Edges. Tom Fojta has written a great article on the legacy VIX method and how its changed in NSX via a new Message Bus technique.

Searching for the VIX_E_FILE_NOT_FOUND error online It would seem that the NSX Manager was having issues talking to a subset of VSE 5.5.3 edges. It was noted by GSS that this was not happening for all VSEs and there were no instances of this happening on the NSX Edge Gateway’s (ESG 6.1.x). Storage was first suspected as being the cause of the issue, so we spent a good deal of time working through ESXi logs and Storage vMotioning the VSEs and NSX Managers to rule out storage. Once that was done, GSS took the case to the NSX Engineering team for further analysis. Engineering took an Export of one of my NSX Edges (uploading 10GB with of OVA is a challenge) to try and work out what was happening and why.

The Cause:

The VSE’s VM UUID as seen from the NSX Manager database somehow becomes different to that listed in the vCenter Inventory…causing the error messages.

The Fix:

There are a couple of options available to resolve the UUID Mismatch.

The self service workaround:
Attempt a redeployment of all VSEs that report the issue. You can get a list by grabbing logs from the NSX Manager and list down the vm-xxxxxx identifier as shown above. From there…head to vCD (Not the Networking & Security Edge section – this will redeploy NSX 6.1.2 Edges) and Click on Redeploy from the Edge Gateway Menu. The only risk with this is that the VSE might get stuck in a Redeploying state resulting in a time-out. Another thing to note with this option is the client services will be effected during the redeployment of the VSE while the new Edge is deployed and the config transferred across.

VMware GSS Database Fix:
If you are seeing these errors in your NSX Manager logs, raise an SR with VMware and they will execute a simple one line SQL Query to alter the UUID of the VMs that don’t match vCenter and update them. Once that’s done the errors go away and the potential for VSEs to go into this state is removed.

Further Info and RCA:

VMware GSS together with NSX Engineering are still investigating the cause of the issue but this seems to be a symptom (though not confirmed) of an in place vCNS to NSX Upgrade and there are no specific factors that seem to trigger this behaviour…the assumption is that this is a bug that comes into play after an upgrade from vCNS with existing VSE 5.5.3 Instances. It’s also interesting that the worst symptom of the issue (apart from the silly amount of logs generated) the VSE going into an unmanageable state or the deployment issue happens intermittently. There is no scientific reason why…but the trigger seems to be any action in vCD on a VSE (new or existing) that executes a config change…if this is done during a health check by the NSX Manager it could leave the VSE in the undesired state.

For those interested the version numbers where the issue was picked are are listed below.

Platform Versions:

  • vCenter 5.5 Update 2 Build 2001466
  • ESXi 5.5 Update 2 Build 2456374
  • vCloud Director 5.5.2 Builds 2000523 and 2233543
  • NSX-v 6.1.2 Build 2318232
  • VSE 5.5.3 Build 2175697

Leave a Reply