In doing some testing around NSX Edge deployment scenarios I came across a small quirk in the High Availability Config for the NSX Edge Gateway where by after configuring HA from either the Web Client of through the Rest API you will see the High Availability Status as Down in the Web Client even though its Enabled and you have two Edge Appliances deployed.
If you go to the CLI of either of the deployed Edge Appliances and run the show service highavailability command you will get the response shown below:
I did a search for Highavailability Healthcheck server is stopped and didn’t get any hits…hence me putting together this post to specifically tackle that message however looking back through my earlier post on Edge Gateway HA (http://anthonyspiteri.net/nsx-edge-vs-vshield-edge-part-2-high-availability/ I did make note of the fact you need at least one vNIC configured.
So, while not so much as a quirk as more a case of by design the edge High Availability Service will only kick in once the first Internal vNIC has been added and configured. If you have enabled HA after doing the initial interface configurations you won’t have this issue as during the HA setup you are asked which vNIC to choose. If you enable HA without a vNIC configured the service won’t kick in until that vNIC is in play. Once this has been done the HA Service kicks in and configures both edges…if you run the show service command again you should now see the Highavailability Status as Running and details on the HA configuration of the NSX Edge pair.
Looking back at the Web Client you will now see the High Availability Service as Up
For more info on understanding HA and some more troubleshooting steps @gabe_rosas has a great post here:
Had a situation pop up today where an NSX Edge needed to be moved from it’s current location to another location due to an initial VM placement issue. The VM was being moved within the bounds of the NSX Transport Zone so no issues with it being out of scope… At first the VM was vMotioned from one Cluster to another via the Web Client…however when the VM was brought back online the NSX Edge Status in the Web Client showed the error below:
While the VIX_E_TIMEOUT group of error messages are common for vShield and NSX Managers the VM was up and actually passing traffic ok…though no config could be applied in this state.
Looking through the specific Edge System Events under the Monitor Tab you see:
First thing I tried was a Force Sync which only served to reboot the Edge. Looking under the Manage Tab under NSX Edges and the NSX Edge Appliances I saw that the Edge details was Deployed and reporting on the correct Host and Datastore it was vMotioned to. From here I attempted a Re-Deploy operation, but that only moved it back to it’s original location fixing the error and making it manageable again.
I went back to the Manage Tab under NSX Edge Appliances and this time I changed the location of the Edge Cluster and Datastore directly from this menu.
This triggered another Edge Redeployment, but this time it was deployed to the desired location and the Edge was manageable again from the NSX Manager’s perspective. Lesson learnt is to not vMotion NSX Edges but take advantage of the fact that the Edges are effectivly stateless VMs that are highly transportable and have their configuration stored centrally for easy redeployment and recovery.