NSX vCloud Retrofit: Controller Deployment Internal Server Error Has Occurred

During my initial work with NSX-v I was running various 6.0.x Builds together with vCD 5.5.2 and vCenter/ESXi 5.5 without issue. When NSX 6.1 was released I decided to clone off my base Production Environment to test a fresh deployment of 6.1.2 into a mature vCloud Director 5.5.2 instance that had vCNS running a VSM version of 5.5.3. When it came time to deploy the first Controller I received the following error from the Networking & Security section of the Web Client:

Looking at the vCenter Tasks and Events the NSX Manager did not even try to kick off the OFV Deployment of the Controller VM…the error it’s self wasn’t giving to much away and suspecting a GUI bug I attempted to deploy the Controller directly via the RestAPIs…this failed as well however the error returned was a little more detailed.

Looking through the NSX Manager Logs the corresponding System Log Message looked like this:

The Issue:

While the error it’s self is fairly straight forward to understand in that the a value being entered into the database table was not of the right type…the reasons behind why it had shown up in the 6.1.1 and 6.1.2 releases after having no such issue working with the 6.0.x releases stumped everyone involved in the ensuing Support Case. In fact it seemed like this was the only/first instance (at the time) of this error in all global NSX-v installs.

The Fix:

NOTE: This can only be performed by VMware Support via an SR.

The fix is a simple SQL Query to alter the KEY_VALUE_STORE referenced in the error…however this can only done done by VMware Support as it requires special access to the NSX Manager Operating System to commit the changes. A word of warning…If you happen to have the secret password to get into the back door of the NSX Manager and apply these changes…your support for NSX could become null and void!

Once that’s been committed and the NSX Manager Service restarted Controllers can be successfully deployed…again, the fix needs to be applied by VMware Support.

The RCA:

In regards to the RCA of this issue,  the customers having a long upgrade history (5.0 onwards) will hit the issue, since there was a db migration happening from 5.0 to 5.1.x upgrade the alter table script for KEY_VALUE_STORE was missing. As per VMware engineering a new upgrade on NSX Manager is not going to override the DB schema change, since there is no such script to alter table on the upgrade path.

There was no indication of this being fixed in subsequent NSX Releases and no real explanation as to why it didn’t happen in 6.0.x but that aside the fix works and can be actioned in 5 minutes.

This was a fairly unique situation that contributed to this bug being exposed…my environment was a a vCNS 5.1 -> 5.5 -> 5.5.2 -> 5.5.3.1 -> NSX 6.1 -> 6.1.1 -> 6.1.2 replica of one of our Production vCloud Zone that sits in our lab. Previously I’d been able to fully deploy NSX end to end using the same base systems which sat side by side in working order in a separate #NestedESXi Lab…but that was vCNS 5.5.2 upgraded to NSX 6.0.5 which was upgraded to 6.1.2.

So due to not too many deployments of NSX-v the issue only manifested in mature vCNS environments that where upgraded to 6.1.1 or 6.1.2. Something to look out for if you are looking at doing an NSX vCloud Retrofit. If you have a green fields site you will not come across the same issue.

Further Reading:

http://anthonyspiteri.net/nsx-bites-nsx-controller-deployment-issues/

This blog series extends my NSX Bytes Blog Posts to include a more detailed look at how to deploy NSX 6.1.x into an existing vCloud Director Environment. Initially we will be working with vCD 5.5.x which is the non SP Fork of vCD, but as soon as an upgrade path for 5.5.2 -> 5.6.x is released I’ll be including the NSX related improvements in that release.