Tag Archives: vShield

NSX Bytes: Updated – NSX Edge Feature and Performance Matrix

A question came up today around throughput numbers for an NSX Edge Services Gateway and that jogged my memory back to a previous blog post where I compared features and performance metrics between vShield Edges and NSX Edges. In the original post I had left out some key metrics, specifically around firewall and load balance throughput so thought it was time for an update. Thanks to a couple of people in the vExpert NSX Slack Channel I was able to fill some gaps and update the tables below.

A reminder that VMware has announced the End of Availability (“EOA”) of the VMware vCloud Networking and Security 5.5.x that kicked in on the September  of 19, 2016 and that vCloud Director 8.10 does not support vShield Edges anymore…hence why I have removed the VSE from the tables.

As a refresher…what is an Edge device?

The Edge Services Gateway (NSX-v) connects isolated, stub networks to shared (uplink) networks by providing common gateway services such as DHCP, VPN, NAT, dynamic routing, and Load Balancing. Common deployments of Edges include in the DMZ, VPN Extranets, and multi-tenant Cloud environments where the Edge creates virtual boundaries for each tenant.

Below is a list of services provided by the NSX Edge.

Service Description
Firewall Supported rules include IP 5-tuple configuration with IP and port ranges for stateful inspection for all protocols
NAT Separate controls for Source and Destination IP addresses, as well as port translation
DHCP Configuration of IP pools, gateways, DNS servers, and search domains
Site to Site VPN Uses standardized IPsec protocol settings to interoperate with all major VPN vendors
SSL VPN SSL VPN-Plus enables remote users to connect securely to private networks behind a NSX Edge gateway
Load Balancing Simple and dynamically configurable virtual IP addresses and server groups
High Availability High availability ensures an active NSX Edge on the network in case the primary NSX Edge virtual machine is unavailable
Syslog Syslog export for all services to remote servers
L2 VPN Provides the ability to stretch your L2 network.
Dynamic Routing Provides the necessary forwarding information between layer 2 broadcast domains, thereby allowing you to decrease layer 2 broadcast domains and improve network efficiency and scale. Provides North-South connectivity, thereby enabling tenants to access public networks.

Below is a table that shows the different sizes of each edge appliance and what (if any) impact that has to the performance of each service. As a disclaimer the below numbers have been cherry picked from different sources and are subject to change…I’ll keep them as up to date as possible.

NSX Edge (Compact) NSX Edge (Large) NSX Edge (Quad-Large) NSX Edge (X-Large)
vCPU 1 2 4 6
Memory 512MB 1GB 1GB 8GB
Disk 512MB 512MB 512MB 4.5GB
Interfaces 10 10 10 10
Sub Interfaces (Trunk) 200 200 200 200
NAT Rules 2000 2000 2000 2000
FW Rules 2000 2000 2000 2000
FW Performance 3Gbps 9.7Gbps 9.7Gbps 9.7Gbps
DHCP Pools 25 25 25 25
Static Routes 2048 2048 2048 2048
LB Pools 64 64 64 64
LB Virtual Servers 64 64 64 64
LB Server / Pool 32 32 32 32
IPSec Tunnels 512 1600 4096 6000
SSLVPN Tunnels 50 100 100 1000
Concurrent Sessions 64,000 1,000,000 1,000,000 1,000,000
Sessions/Second 8,000 50,000 50,000 50,000
LB Throughput L7 Proxy) 2.2Gbps 2.2Gbps 3Gbps
LB Throughput L4 Mode) 6Gbps 6Gbps 6Gbps
LB Connections/s (L7 Proxy) 46,000 50,000 50,000
LB Concurrent Connections (L7 Proxy) 8,000 60,000 60,000
LB Connections/s (L4 Mode) 50,000 50,000 50,000
LB Concurrent Connections (L4 Mode) 600,000 1,000,000 1,000,000
BGP Routes 20,000 50,000 250,000 250,000
BGP Neighbors 10 20 50 50
BGP Routes Redistributed No Limit No Limit No Limit No Limit
OSPF Routes 20,000 50,000 100,000 100,000
OSPF Adjacencies 10 20 40 40
OSPF Routes Redistributed 2000 5000 20,000 20,000
Total Routes 20,000 50,000 250,000 250,000

Of interest from the above table it doesn’t list any Load Balancing performance number for the NSX Compact Edge…take that to mean that if you want to do any sort of load balancing you will need NSX Large and above. To finish up, below is a table describing each NSX Edge size use case.

Use Case
NSX Edge (Compact) Small Deployment, POCs and single service use
NSX Edge (Large) Small/Medium DC or mult-tenant
NSX Edge (Quad-Large) High Throughput ECMP or High Performance Firewall
NSX Edge (X-Large) L7 Load Balancing, Dedicated Core

References:

https://www.vmware.com/files/pdf/products/nsx/vmw-nsx-network-virtualization-design-guide.pdf

https://pubs.vmware.com/NSX-6/index.jsp#com.vmware.nsx.admin.doc/GUID-3F96DECE-33FB-43EE-88D7-124A730830A4.html

http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2042799

NSX Bytes: vCloud Director Can’t Deploy NSX Edges

Over the weekend I was tasked with the recovery of a #NestedESXi lab that had vCloud Director and NSX-v components as part of the lab platform. Rather than being a straight forward restore from the Veeam backup I also needed to downgrade the NSX-v version from 6.2.4 to 6.1.4 for testing purposes. That process was relatively straight forward and involved essentially working backwards in terms of installing and configuring NSX and removing all the components from vCenter and the ESXi hosts.

To complete the NSX-v downgrade I deployed a new 6.1.4 appliance and connected it back up to vCenter, configured the hosts, setup VXLAN, transport components and tested NSX Edge deployments through the vCenter Web Client. However, when it came time to test Edge deployments from vCloud Director I kept on getting the following error shown below.

Checking through the NSX Manager logs there was no reference to any API call hitting the endpoint as is suggested by the error detail above. Moving over to the vCloud Director Cells I was able to trace the error message in the log folder…eventually seeing the error generated below in the vcloud-container-info.log file.

As a test I hit the API endpoint referenced in the error message from a browser and got the same result.

This got me thinking that the error was either DNS related or permission related. After confirming that the vCloud Cells where resolving the NSX Manager host name correctly, as suggested by the error I looked at permissions as the cause of the 403 error. vCloud Director was configured to use the service.vcloud service account to connect to the previous NSX/vShield Manager and it dawned on me that I hadn’t setup user rights in the Web Client under Networking & Security. Under the Users section of the Manage Tab the service account used by vCloud Director wasn’t configured and needed to be added. After adding the user I retried the vCD job and the Edge deployed successfully.

While I was in this menu I thought I’d test what level of NSX User was required to for that service account to have in order to execute operations against vCloud Director and NSX. As shown below anything but NSX or Enterprise Administrator triggered a “VSM response error (254). User is not authorized to access object” error.

At the very least to deploy edges, you require the service account to be NSX Administrator…The Auditor and Security Administrator levels are not enough to perform the operations required. More importantly don’t forget to add the service account as configured in vCloud Director to the NSX Manager instance otherwise you won’t be able to have vCloud Director deploy edges using NSX-v.

 

 

NSX Bytes: NSX 6.2.3 and vShield Endpoint Clarification

NSX-v 6.2.3 has been out for a couple of weeks now and besides the new features and bug fixes there was a significant change to the licensing structure for NSX. Previously there really wasn’t any concept of NSX editions…however 6.2.3 introduced four new tiers. As was announced early May NSX-v comes in Standard, Enterprise and Enterprise Plus. At the time there was still no public mention of what was to happen to existing vCloud Network and Security customers utilizing vShield Endpoint…more so given that vCNS is to be end of lifed in September.

Looking through the release notes for NSX-v 6.2.3 there is a section that talks about the licensing and in addition to the three editions there is a default license which allows use of the vShield Endpoint feature…which is called Guest Introspection under NSX.

Change in default license & evaluation key distribution: default license upon install is “NSX for vShield Endpoint”, which enables use of NSX for deploying and managing vShield Endpoint for anti-virus offload capability only. Evaluation license keys can be requested through VMware sales.

Everyone who is entitled to the vSphere vCloud suits will now download NSX instead of vCNS. Depending on your use case, that will dictate which license you decide to apply, therefore unlocking different features of NSX…People will truly be running NSX everywhere…remembering that as of the current 6.1.x and 6.2.x releases the NSX Manager is a beefed up version of the vShield Manager. The good news for people who are running vShield Endpoint services for Antivirus and other guest introspection tasks will be able to manage this through the Web Client.

In terms of what NSX parts need installing/upgrading from the vCNS bits, you only need to perform a Host Preparation and Guest Introspection install. There is no need to run NSX Controllers or configure VXLAN in order to run Endpoint services…if you want to be able to run those NSX features you will need to request specific NSX edition keys to suit your requirements.

For a complete rundown on NSX-v Licensing Edition features click here.

References:

http://pubs.vmware.com/Release_Notes/en/nsx/6.2.3/releasenotes_nsx_vsphere_623.html

NSX Bytes: Critical Update for NSX-v and vCNS

I generally don’t post around security releases but after going through the notes on CVE-2016-2079 I thought it was important enough to dedicate a post around. Mainly because it could impact those running NSX Edge Services Gateways or vShield Edges with the SSL-VPN service enabled for clients.

Most vCloud Director based instances won’t have the SSL-VPN enabled due to it not being exposed through the vCD UI however some Service Providers may offer this as a managed service as it’s one of the strongest features of the Edge Gateways. The issue detailed in the CVE is summarized below.

VMware NSX and vCNS with SSL-VPN enabled contain a critical input validation vulnerability. This issue may allow a remote attacker to gain access to sensitive information.

In a nutshell you need to upgrade an existing version of NSX-v or vCNS to the version below. As per usual if you have the entitlements go ahead and download the updates from the links below.

  • NSX Edge: 6.2 -> 6.2.3
  • NSX Edge: 6.1 -> 6.1.7
  • vCNS Edge: 5.5 -> 5.5.4.3

NSX-v  Downloads: https://www.vmware.com/go/download-nsx-vsphere

vCNS Downloads: https://www.vmware.com/go/download-vcd-ns

References:

http://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2016-2079

Important – vCNS and NSX End of Availability and Support Notifications

For a while now we have known that vCloud Networking and Security’s days where numbered…with the release of NSX as a replacement+ product it had been communicated to current vCNS customers that an upgrade to NSX-v would be on the cards to ensure continued support and functionality. The date has now been set for the EOA of vCNS and in somewhat of a surprise to me VMware also last week announced the EOA for NSX-v 6.1.x will reach end of availability later in the year.

VMware has announced the End of Availability (“EOA”) of the VMware vCloud Networking and Security 5.5.x which will commence on September 19, 2016

VMware has announced the End of Availability (“EOA”) of the VMware NSX for vSphere 6.1.x and will commence on October 15, 2016

In both cases the VMwareKBs state that the products will continue to function. However, support will no longer be available, nor update releases or patches…so end of the day use at your own risk and don’t expect any help is the proverbial hits the fan.

The EOA and Support of NSX-v, while a surprise can be dealt with fairly easily by existing NSX-v customers. To get the most out of NSX-v in terms of the enhanced capabilities and features you should be running a version of 6.2.x and there is a new major release just around the corner (to be announced later in the year possibly). The only current caveat is upgrades from 6.1.5 to 6.2.0 are not supported…you must upgrade from 6.1.5 to NSX 6.2.1 or later to avoid a regression in functionality.

With regard to existing vCNS customers who are not Service Providers or have not gotten their hands on…let alone wrapped their heads around NSX-v this isn’t fantastic news. This Reddit post sums up some of the feeling out there in regards to the upgrade path for vCNS to NSX-v. To sum up the general feeling that I have come across…NSX-v is a lot more expensive than what vCNS was (in most cases it was part of the general vSphere/vCloud editions and bundles) and existing users of vCNS are finding it hard to justify that cost when considering the fact that some of the best NSX-v features are surplus to their requirements.

End of the day here there aren’t too many options for vCNS customers, but there is talk about VMware releasing an NSX-Lite version to satisfy the gap that exists between current customer requirements of vCNS features vs the all in nature of the NSX-v feature set…the clock is now ticking!

vCloud Director and vCNS:

Tom Fojta blogged earlier in the week that VMware have released an additional whitepaper for for vCAT SP that goes through a vCNS upgrade to NSX in vCloud Director Environments. I’ve also covered that in my vCloud Director NSX Retrofit series here.

References:

https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2144733

https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2144769

 

NSX Bytes: NSX Edge – High Availability Status Down

In doing some testing around NSX Edge deployment scenarios I came across a small quirk in the High Availability Config for the NSX Edge Gateway where by after configuring HA from either the Web Client of through the Rest API you will see the High Availability Status as Down in the Web Client even though its Enabled and you have two Edge Appliances deployed.

If you go to the CLI of either of the deployed Edge Appliances and run the show service highavailability command you will get the response shown below:

I did a search for Highavailability Healthcheck server is stopped and didn’t get any hits…hence me putting together this post to specifically tackle that message however looking back through my earlier post on Edge Gateway HA (http://anthonyspiteri.net/nsx-edge-vs-vshield-edge-part-2-high-availability/ I did make note of the fact you need at least one vNIC configured.

So, while not so much as a quirk as more a case of by design the edge High Availability Service will only kick in once the first Internal vNIC has been added and configured. If you have enabled HA after doing the initial interface configurations you won’t have this issue as during the HA setup you are asked which vNIC to choose. If you enable HA without a vNIC configured the service won’t kick in until that vNIC is in play. Once this has been done the HA Service kicks in and configures both edges…if you run the show service command again you should now see the Highavailability Status as Running and details on the HA configuration of the NSX Edge pair.

Looking back at the Web Client you will now see the High Availability Service as Up

For more info on understanding HA and some more troubleshooting steps @gabe_rosas has a great post here:

NSX Edge vs vShield Edge: Part 3 – IPsec and L2 VPN

Overview:

NSX and vShield Edges support site to site IPSec VPN between Edge instances and remote sites. Behind each remote VPN router, you can configure multiple subnets to connect to the internal network behind an Edge through IPSec tunnels. These subnets and the internal network behind the Edges must have address ranges that do not overlap. You can have a maximum of 64 tunnels across a maximum of 10 sites.

NSX Edges are also capable of L2 VPNs where you can stretch both VXLAN and VLAN across geographical sites…This allows VMs to remain on the same subnet when they are moved between sites with the IP addresses not changing. L2 VPN allows seamless migration of workloads backed by VXLAN or VLAN between physically separated locations. Specifically for Service Providers L2 VPN provides a mechanism to on-board tenants without modifying IP addresses for VM workloads.

In this post I am only going to go through IPsec VPN configuration…feel there is a whole separate post required to do L2 VPN justice. The biggest difference between an NSX and vShield Edge when looking to configure VPNs is that when you are managing a vShield Edge you will not see the options to configure L2 VPN as shown in the configuration example below.

Configuring IPsec VPN From Web Client:

Configuration Items Required:

  • Local Endpoint
  • Local Subnets
  • Peer Endpoint
  • Peer Subnets
  • Encryption Algorithm and Authentication mechanism
  • Pre Shared Key
  • Diffie-Hellman Group

Double Click on the Edge under the NSX Edge Menu Option in Networking and Security, In the VPN Tab under Configuration click on Enable next to IPsec VPN Service Status and then hit Publish Changes

To create a new Tunnel, click on + and enter in the details collected as per the items listed above.

Click ok and then Publish the Changes…from there the Status should show a green tick. Once the other side has been configured check to see that the Tunnel(s) are up by clicking on Show IPsec Statistics.

If both sides are happy you should be able to talk between the configured subnets. Shown below you see an example of a Site to Site with One Tunnel configured up…and one down.

Configuring IPsec VPN From vCloud Director UI:

For vShield Edges managed via vCloud Director, head to the vCD UI and under Administration and the Edge Gateways. Right Click on the Edge and Configure Services. Under the VPN Tab you first want to Enable VPN and Configure the Public IPs.

Enter in the Public IP as shown above and click ok.

Click on Add and enter in the details collected. For Site to Site VPNs drop down the Establish VPN to: dropdown to a remote network and configure the rest of the settings.

Once done, you should see the Enabled and Status Column with green ticks.

A nice addition to the vCD UI (sometimes the UI team gets things right) is the Peer Settings Button which shows you the bits required to configure the other end of the connection.

Enabling/Disabling/Viewing IPsec With REST API:

Below are the key API commands to configure and manage IPsec VPN.

NSX Edge vs vShield Edge: Part 2 – High Availability

Overview:

High Availability in both VSE and NSX Edges ensures Edge Network Services are always available by deploying a pair of Edge Appliances that work together in an active/passive HA cluster Pair. The primary appliance is in the active state and the secondary appliance is in the standby state. The configuration of the primary appliance is replicated to the standby appliance.

All Edge services run on the active appliance. The primary appliance maintains a heartbeat with the standby appliance and sends service updates through an internal interface. Declared Dead Time is used to work out via Heartbeating between both appliances when a HA event should take place. If the primary is declared dead the standby appliance moves to the active state and takes over the interface configuration of the primary.

For both NSX and VSE managed via the NSX Manager, HA can be triggered by the vCenter Web Client or API. The VSE can also have HA triggered through the vCloud Director UI or API.

Configuring NSX/VSE HA From Web Client:

Double Click on the Edge under the NSX Edge Menu Option in Networking and Security, In the Settings Tab under Configuration click on Change in the HA Configuration Box

Click on Enable and leave the rest of the settings as default. You do have the option to select the vNIC if multiple Interfaces exist. Leaving it as default if a safe option. Almost all documentation I have written on the default Declare Dead Time states that it is 6 seconds, however in the Web Client it defaults to 15. You also have the ability to configure specific IPs to use as Management or Cluster IPs for each HA Pair.

At this point a second Edge Appliance will be deployed into the vCenter and you will see an Edge appliance with -1 appended to the name. As shown below the NSX Manager will initiate the creation of a DRS Anti Affinity Rule to keep the Edges separate

Shown above is an example of both an NSX and vShield Edge and their anti affinity rule configured.

NOTE: For the HA settings to be applied to both Appliances at least one Interface (excluding Uplink) needs to be configured. If you don’t have an Interface configured the HighAvailability Service status on the Edge will be set to not running.

Configuring VSE HA From vCloud Director UI:

Depending on your Level of access to External Networks, right click on the Edge in the vCD UI and click on the Enable High Availability Check Box as shown below.

Enabling/Disabling/Viewing NSX/VSE HA With REST API

Below are the key API commands to configure and manage HA.





There is is nothing fundamentally enhanced in the NSX HA vs VSE, it’s a simple…easy to enable feature that adds a level of availability to Edge Networking services.

Sources and More Reading:

http://blogs.vmware.com/vsphere/2013/03/vcloud-networking-and-security-5-1-edge-gateway-high-availability.html

https://pubs.vmware.com/NSX-6/index.jsp#com.vmware.nsx.admin.doc/GUID-6C4F0C33-C6DD-432B-AA91-10AD6B449125.html

http://nsxtech.net/2014/09/20/understanding-high-availability-on-the-nsx-edge-services-gateway/

http://lostdomain.org/2014/10/18/vmware-nsx-best-practices-from-vmworld/

NSX Edge vs vShield Edge: Part 1 – Feature and Performance Matrix

I was having a discussion internally about why we where looking to productize the NSX Edges for our vCloud Director Virtual Datacenter offering over the existing vCNS vShield Edges. A quick search online didn’t come up with anything concrete so I’ve decided to list out the differences as concisely as possible.

This post will go through a basic side by side comparison of the features and performance numbers…I’ll then extend the series to go into specific differences between the key features. As a reminder vCloud Director is not NSX aware just yet, but through some retrofiting you can have NSX Edges providing network services for vCD Datacenters.

Firstly…what is an Edge device?

The Edge Gateway (NSX-v or vCNS) connects isolated, stub networks to shared (uplink) networks by providing common gateway services such as DHCP, VPN, NAT, dynamic routing (NSX Only) , and Load Balancing. Common deployments of Edges include in the DMZ, VPN Extranets, and multi-tenant Cloud environments where the Edge creates virtual boundaries for each tenant.

Below is a list of services provided by each version. The + signifies an enhanced version of the service offered by the NSX Edge.

Service Description vSheld
Edge
NSX Edge
Firewall Supported rules include IP 5-tuple configuration with IP and port ranges for stateful inspection for all protocols
NAT Separate controls for Source and Destination IP addresses, as well as port translation
DHCP Configuration of IP pools, gateways, DNS servers, and search domains ✔+
Site to Site VPN Uses standardized IPsec protocol settings to interoperate with all major VPN vendors
SSL VPN SSL VPN-Plus enables remote users to connect securely to private networks behind a NSX Edge gateway ✔+
Load Balancing Simple and dynamically configurable virtual IP addresses and server groups ✔+
High Availability High availability ensures an active NSX Edge on the network in case the primary NSX Edge virtual machine is unavailable ✔+
Syslog Syslog export for all services to remote servers
L2 VPN Provides the ability to stretch your L2 network.
Dynamic Routing Provides the necessary forwarding information between layer 2 broadcast domains, thereby allowing you to decrease layer 2 broadcast domains and improve network efficiency and scale. Provides North-South connectivity, thereby enabling tenants to access public networks.

Below is a table that shows the different sizes of each edge appliance and what (if any) impact that has to the performance of each service. As a disclaimer the below numbers have been cherry picked from different sources and are subject to change…I’ll keep them as up to date as possible

  vShield
Edge (Compact)
vShield
Edge (Large)
vShield
Edge (X-Large)
NSX
Edge (Compact)
NSX Edge (Large) NSX Edge (Quad-Large) NSX Edge (X-Large)
vCPU 1 2 2 1 2 4 6
Memory 256MB 1GB 8GB 512MB 1GB 1GB 8GB
Disk 320MB 320MB 4.4GB 512MB 512MB 512MB 4.5GB
Interfaces 10 10 10 10 10 10 10
Sub Interfaces (Trunk)  –  –  – 200 200 200 200
NAT Rules 2000 2000 2000 2000 2000 2000 2000
FW Rules 2000 2000 2000 2000 2000 2000 2000
DHCP Pools 10 10 10 20,000 20,000 20,000 20,000
Static Routes 100 100 100 2048 2048 2048 2048
LB Pools 64 64 64 64 64 64 64
LB Virtual Servers 64 64 64 64 64 64 64
LB Server / Pool 32 32 32 32 32 32 32
IPSec Tunnels 64 64 64 512 1600 4096 6000
SSLVPN Tunnels 25 100 50 100 100 1000
Concurrent Sessions 64,000 1,000,000  1,000,000 64,000 1,000,000 1,000,000 1,000,000
Sessions/Second 8,000 50,000
LB Connections/s (L7 Proxy) 46,000 50,000
LB Concurrent Connections (L7 Proxy) 8,000 60,000
LB Connections/s (L4 Mode) 50,000 50,000
LB Concurrent Connections (L4 Mode) 600,000 1,000,000
BGP Routes 20,000 50,000 250,000 250,000
BGP Neighbors 10 20 50 50
BGP Routes Redistributed No Limit No Limit No Limit No Limit
OSPF Routes 20,000 50,000 100,000 100,000
OSPF Adjacencies 10 20 40 40
OSPF Routes Redistributed 2000 5000 20,000 20,000
Total Routes 20,000 50,000 250,000 250,000

Note: I still have a few numbers to complete specifically around NSX Edge Load Balancing and I’m also trying to chase up throughput numbers for Firewall and LB.

From the table above it’s clear to see that the NSX Edge provides advanced networking services and higher levels of performance. Dynamic Routing is a huge part of the reason why and NSX Edge fronting a vCloud vDC opens up so many possibilities for true Hybrid Cloud.

vCNS’s future is a little cloudy, with vCNS 5.1 going EOL last September and 5.5 only available through the vCloud Suite with support ending on 19/09/2016. When you deploy edges with vCloud Director (or in vCloud Air On Demand) you deploy the 5.5.x version so short term understanding the differences is still important…however the future lies with the NSX Edge so don’t expect the VSE numbers to change or features to be added.

References:

https://www.vmware.com/files/pdf/products/nsx/vmw-nsx-network-virtualization-design-guide.pdf

https://pubs.vmware.com/NSX-6/index.jsp#com.vmware.nsx.admin.doc/GUID-3F96DECE-33FB-43EE-88D7-124A730830A4.html

http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2042799

NSX vCloud Retrofit: Upgrade Issue – Edge Gateway Unmanageable in vCloud Director or Deployment Fails

We have been working with VMware GSS on an issue for a number of weeks whereby we were seeing some vShield Edge devices go into an unmanageable state from within the vCloud Director Portal. In a nutshell some VSEs (version 5.5.3) where stuck in a Configuring Loop upon the committal of a Service Config change. Subsequent reboots of the NSX Manager or vCloud Director Cells did not result in the VSE coming out of this state. While the VSE was not able to be managed from vCD the Edge Services where still functional…ie traffic was passing through and all existing rules and features where working as expected.

Looking at the vCD Logs the following entry was seen:

We also saw issues deploying some VSEs from vCloud Director where Deployment of edge gateways failed.

If the failed attempt was retried via a redeployment action the following was seen in the vCD logs with the vCD GUI stuck showing Reploying Edge in Progress

Heading over to the the NSX Manager logs we came across the following error log entry being constantly written to the system manager logs…in fact we were seeing this message pop up approximately 25,000 times a day across three NSX instances.

The VIX API:

The NSX Manager…and vShield Manager before it uses the VIX API to query vCenter and the ESXi Host running the Edge VMs via VMTools to query the status of the Edges. Tom Fojta has written a great article on the legacy VIX method and how its changed in NSX via a new Message Bus technique.

Searching for the VIX_E_FILE_NOT_FOUND error online It would seem that the NSX Manager was having issues talking to a subset of VSE 5.5.3 edges. It was noted by GSS that this was not happening for all VSEs and there were no instances of this happening on the NSX Edge Gateway’s (ESG 6.1.x). Storage was first suspected as being the cause of the issue, so we spent a good deal of time working through ESXi logs and Storage vMotioning the VSEs and NSX Managers to rule out storage. Once that was done, GSS took the case to the NSX Engineering team for further analysis. Engineering took an Export of one of my NSX Edges (uploading 10GB with of OVA is a challenge) to try and work out what was happening and why.

The Cause:

The VSE’s VM UUID as seen from the NSX Manager database somehow becomes different to that listed in the vCenter Inventory…causing the error messages.

The Fix:

There are a couple of options available to resolve the UUID Mismatch.

The self service workaround:
Attempt a redeployment of all VSEs that report the issue. You can get a list by grabbing logs from the NSX Manager and list down the vm-xxxxxx identifier as shown above. From there…head to vCD (Not the Networking & Security Edge section – this will redeploy NSX 6.1.2 Edges) and Click on Redeploy from the Edge Gateway Menu. The only risk with this is that the VSE might get stuck in a Redeploying state resulting in a time-out. Another thing to note with this option is the client services will be effected during the redeployment of the VSE while the new Edge is deployed and the config transferred across.

VMware GSS Database Fix:
If you are seeing these errors in your NSX Manager logs, raise an SR with VMware and they will execute a simple one line SQL Query to alter the UUID of the VMs that don’t match vCenter and update them. Once that’s done the errors go away and the potential for VSEs to go into this state is removed.

Further Info and RCA:

VMware GSS together with NSX Engineering are still investigating the cause of the issue but this seems to be a symptom (though not confirmed) of an in place vCNS to NSX Upgrade and there are no specific factors that seem to trigger this behaviour…the assumption is that this is a bug that comes into play after an upgrade from vCNS with existing VSE 5.5.3 Instances. It’s also interesting that the worst symptom of the issue (apart from the silly amount of logs generated) the VSE going into an unmanageable state or the deployment issue happens intermittently. There is no scientific reason why…but the trigger seems to be any action in vCD on a VSE (new or existing) that executes a config change…if this is done during a health check by the NSX Manager it could leave the VSE in the undesired state.

For those interested the version numbers where the issue was picked are are listed below.

Platform Versions:

  • vCenter 5.5 Update 2 Build 2001466
  • ESXi 5.5 Update 2 Build 2456374
  • vCloud Director 5.5.2 Builds 2000523 and 2233543
  • NSX-v 6.1.2 Build 2318232
  • VSE 5.5.3 Build 2175697
« Older Entries