NSX Bytes: Controller Deployment Gone Bad?

With NSX becoming more and more widely available there are more NSX home labs being stood up and with that the chances of the NSX Controllers failing due to “Home Lab” nested issues become more prevalent. The NSX Controllers are Ubuntu Linux VMs and like any Linux VM are fairly sensitive to storage latency and other issues that appear in #NestedESXi or lab environments.

In one of my labs I came across an issue where I needed to redeploy all the NSX Controllers due to the VMs effectively breaking due to the storage being ripped out from under them…however when I went to redeploy the latency of the underlying nested storage was still not that great and the deployment got stuck in a loop as shown below.

No matter what I tried…vCenter restart, NSX Manager Reboot or Host Reboot the end result was the status remaining in the spinning state. If I tried to deploy another controller I would get the following error.

Controller IP address allocation failed for reason : cluster already contains controller of IP x.x.x.x

In my case the VM existed with the IP address configured against the VM however I could not access the cli to check NSX Cluster Status due to the fact the VM was in a pretty bad way.

Taking a look at the IP Pool allocations…even though the error said that the IP was in use, it wasn’t listed as such…meaning it was trying to use the first IP in the pool regardless.

Before going into the fix, it should be noted that if this scenario was to happen, and you where down to your last controller in production you would be best served to call up VMware Support and work through the restore options as without any controllers your VXLAN Unicast traffic isn’t going to be updated via the VTEPS and things will eventually grind to a halt. It’s also worth reading the VMware Docs on what to do if even one Controller is lost in a cluster. If this is in a lab scenario…we can be a little harsher!

While the Controller status is spinning in a Deploying state you can’t interact with it via the Web Client. You need to turn to the API to delete the NSX Controller and start again or deploy a new cluster set. First you will need the CONTROLLER-ID which can be easily seen via the Web Client. To remove the controller you need to call the API below using the Delete method. If the stuck controller is the last one in the cluster you need to add the ?forceRemoval=True option at the end of the call.

Once complete you should get a 200 status and a job data ID. If you check back in at the Web Client you should see the Controller VM being deleted and it being removed from the list under Controller Nodes. We are now free of the Deploying Loop and can rebuild or extend the NSX Controller cluster as is appropriate.

References:

https://pubs.vmware.com/NSX-62/index.jsp?topic=%2Fcom.vmware.nsx.admin.doc%2FGUID-3A84E9D1-CAC0-41B1-B45C-E032B230DB49.html