Late last year I was load testing against a new storage platform using both physical and nested ESXi hosts…at the time I noticed decreased network throughput while using Load Test VMs hosted on the nested hosts. I wrote this post and reached out to William Lam who responded with an explanation as to what was happening and why promiscuous mode was required for nested ESXi installs.
http://www.virtuallyghetto.com/2013/11/why-is-promiscuous-mode-forged.html
Forward to VMworld 2014 and in a discussion I had with William at The W Bar (where lots of great discussions are had) after the Official Party he mentioned that a new Fling was about to be released that addresses the issues with nested ESXi hosts and promiscuous mode enabled on the Virtual Switches. As William explains in his new blog post he took the problem to VMware Engineering who where having similar issues in their R&D Labs and have come up with a workaround…this workaround is now an official Fling! Apart from feeling a little bit chuffed that I sparked interest in this problem which has resulted in a fix, I decided to put it to the test in my lab.
I ran the same tests that I ran last year. Running one load test on a 5.5 ESXi host nested on a physical 5.5 Host I saw equal network utilization across all 6 nested hosts.
The Load VM was only able to push 15-17MBps on a random read test. As William saw in his post ESXTOP shows you more about whats happening
About even network throughput across all NICs on all Hosts that are set for Promiscuous Mode…Overall throughput is reduced
After installing the VIB on the Physical host, you have to add the Advanced Virtual Machine settings to each Nested Host to enable the MAC Learning. Unless you do this via an API call you will need to shutdown the VM to edit the VMX/Config. I worked through a set of PowerCLI commands shown below to bulk add the Advanced Setting to running Nested Hosts. Below works for any VM matching ESX in a resource pool and has two NICs.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 |
Get-ResourcePool NAME | get-vm | where {$_.Name -match "ESX"} | Get-AdvancedSetting | where {$_.Name -match "Ethernet"} Name Value Type Description ---- ----- ---- ----------- ethernet0.pciSlot... 32 VM ethernet1.pciSlot... 33 VM ethernet0.pciSlot... 32 VM ethernet1.pciSlot... 33 VM ethernet0.pciSlot... 32 VM ethernet1.pciSlot... 33 VM ethernet0.pciSlot... 32 VM ethernet1.pciSlot... 33 VM ethernet0.pciSlot... 32 VM ethernet1.pciSlot... 33 VM ethernet0.pciSlot... 32 VM ethernet1.pciSlot... 33 VM Get-ResourcePool NAME | get-vm | where {$_.Name -match "ESX"} | new-AdvancedSetting -name ethernet0.filter4.name -value dvfilter-maclearn Get-ResourcePool NAME | get-vm | where {$_.Name -match "ESX"} | new-AdvancedSetting -name ethernet1.filter4.name -value dvfilter-maclearn Get-ResourcePool NAME | get-vm | where {$_.Name -match "ESX"} | new-AdvancedSetting -name ethernet0.filter4.onFailure -value failOpen Get-ResourcePool NAME | get-vm | where {$_.Name -match "ESX"} | new-AdvancedSetting -name ethernet1.filter4.onFailure -value failOpen Get-ResourcePool NAME | get-vm | where {$_.Name -match "ESX"} | Get-AdvancedSetting | where {$_.Name -match "filter"} | ft Name Value Type Description ---- ----- ---- ----------- vmci.filter.enable true VM ethernet0.filter4... dvfilter-maclearn VM ethernet1.filter4... dvfilter-maclearn VM ethernet0.filter4... failOpen VM ethernet1.filter4... failOpen VM ... |
Checking back in on ESXTOP it looks to have an instant effect and only the Nested Host generating the traffic shows significant network throughput…the other hosts are doing nothing and I am now seeing about 85-90MBps against the load test.
Taking a look at Network Throughput graphs (below) you can see an example of two Nested Hosts in the group with the same throughput until the dvFilter was installed at which point traffic dropped on the host not running the load test. Throughput increased almost five fold on the host running the test.
The effect on Nested Host CPU utilization is also dramatic. Only the host generating the load has significant CPU usage while the other hosts return to normal operations…meaning overall the physical host CPUs are not working as hard.
As William mentions in his post this is a no brainer install for anyone using nested ESXi hosts for lab work…thinking about further implications of this fix I am thinking about the possibility of being able to support full nested environments within Virtual Data Centers without the fear of increased host CPU and decreased network throughput…for this to happen though VMware would need to change their stance on supportability of Nested ESXi environments…but this Fling, together with the VMTools Fling certainly makes nested hosts all that more viable.