NSX Bytes – Controller Deployment Gotchyas

There are a lot of great posts already out there in regards to install and configuration of NSX. Rather than reinvent the wheel I’ve decided to do a series of NSX Bytes relating to a couple of gotchyas I’ve come across during the config stage. This post will focus on the NSX Controller deployment which provides a control plane to distribute network information to hosts in your VXLAN Transport zone.

To get up to this part you would have had to deploy the NSX Management VM and prepare your management network which you specify in the IP Pool setup of the deployment. It’s suggested that you deploy three NSX Controllers for HA and resiliency. If successful you should see the Management Tab in the Networking and Security Section of the vCenter Web Client looking like this:

In my first attempt I managed to successfully deploy all three controllers without issue, however in my second Lab I ran into a couple of issue that initially had me scratching my head. It must be noted that there isn’t much, if any error feedback provided to you via the vCenter Client. To get more detail I enabled SSH on the NSX Manager GUI and tailed the manager log by running the command

The output is verbose but useful and I’d encourage familiarity with them. There is a Syslog setting than can send the logs out to an external monitoring system as well if you wish.

ISSUE 1: No Host is Compatible with the Virtual Machine

After setting off a new Controller Build you see vCenter Deploy the OVF Template, Reconfigure the VM and then Power off and Delete the VM.

This was due to the fact that the resource requirements for the Controller Template where not able to be met…specifically the vCPU Count. The ESXi Hosts in my lab where only capable of running VMs with 2vCPUs (1 Socket, 2 Cores) and because of that the deployment failed.

Key there is to ensure that the spec can be reached as shown above. This issue should be restricted to Lab Hosts, but none the less it’s one to look out for just in case.

ISSUE 2: Controller VM Appears to deploys successfully then gets deleted

After setting off a new Controller Build you see the OVF Template Deploy and start up. After about 10 minutes, If you launch the VM console you will see that the Controller is being configured with the right IP Pool settings and reboots in a ready state. Without warning the VM is shutdown and deleted.

Checking through the Manager Logs you see this entry relating to the destruction of the Controller VM

This one is actually pretty easy, and was a user error on my part. The logs clearly state that there was a timeout waiting for the controller to be ready. This was due to the wrong Connected To Network being selected during the Add Controller phase. This network must be able to contact the NSX Manager and vSphere Components…again an obvious error once you view the logs…but initially it just appeared like vCenter via the NSX Service account was deleting the VM for the hell of it.

 

Note: For more detailed install guides check our Chris Wahl’s Post here and Anthony Burke’s Post here to get up to speed on initial NSX Management, Controller and Transport Prep.

9 comments

  • I’ve run into a similar situation, but the root cause was related to the CPU reservation value that was attempting to be set on the controller VM. It exceeded what my lab had to offer. I ended up using the API to set the value to “small” and edited the reservations manually to turn them down a few notches. Probably not a big deal in a beefy production environment.

    • Yea, I first assumed it was an Admission Control type of issue, but had enough resources in the lab that I looked elsewhere. I don’t expect too many to run into that one…kept me on the go for a bit.

      That API post you did was awesome by the way…as you mentioned above it gives a lot more granular control.

  • Yeah – Issue one is important. We have set the mandatory minimums at this to ensure a guaranteed minimum service. Plus we also want to ensure the solution can scale.

    With that said you can get away with install one in a lab if you’re not testing controller based failure scenarios. Just DO NOT do that in production 🙂

    And remember Ant, correct port group ;)!

  • Hmmm, I’m facing the 10 minute delete thingy: the VM is brought up, but it is unable to get configured (an LDR vshield app). I first thought it was an MTU issue (workstation based lab setup, it takes some work to get 1600 running) but that was not it. Still looking for what the issue is.

  • Hi Carlos, how did you get MTU set up to 1600 in VMW workstation ?

    • Hey Carlos… All this was done in a NestedESXi lab.

    • Workstation is fine with 1600 MTU… in internal switches, iff you change the interface type to vmxnet3. I’ve written elsewhere that external (would be uplink) traffic is a no go for large MTU in one direction (incoming, if memory serves) So you would be fine if everything is hosted in one machine. In any case, I ended moving my test lab to a nested (ESXi as a host) setup.

  • Cristian Salgueiro

    My controllers are automatically deleted after creation, due connection problems. There is no connection problems, the controller can ping vcenter, can ping nsx manager, dns are correctly configured, there is all in the same subnet and still are deleted after creation with the error about connectivity.

  • Thanks for tip with 4vCPU.

Leave a Reply