Tag Archives: NFS

ESXi 5.x NFS IOPS Limit Bug – Latency and Performance Hit

There is another NFS bug hidden in the latest ESXi 5.x releases…while not as severe as the 5.5 Update 1 NFS Bug it’s been the cause of increased Virtual Disk Latency and Overall Poor VM performance across a couple of the environments I manage.

The VMwareKB Article referencing the bug can be found here:

Symptoms

  • When virtual machines run on the same host and use the same NFS datastore and the IOPS limit is set on at least one virtual machine, you experience high virtual disks latency and low virtual disk performance.
  • Even when a different IOPS limit is specified for different virtual machines, IOPS limit on all virtual machines is set to the lowest value assigned to a virtual machine in the NFS datastore.
  • You do not experience the virtual disks latency and low virtual disk performance issues when virtual machines:
    • Reside on the VMFS datastore.
    • Run on the same ESXi host, but are placed on the NFS datastore in which none of the virtual machines have an IOPS limit set.
    • Run on the same ESXi host, but are placed on different NFS datastores that do not share the client connection
    • Run on different ESXi hosts, but are placed on the same NFS datastore

So in a nutshell if you have a large NFS Datastore with many VMs with IOPS Limits to prevent against Noisy Neighbours you may experience, what looks like unexplained VM Latency and overall bad performance.

Some proof of this bug can be seen in the screenshot below, where a VM residing on an NFS datastore with IOPS limits applied was exhibiting high Disk Command Latency. It had an IOPS limit of 1000 and wasn’t constrained by setting…yet it had Disk Command Latency in the 100s. The red arrow represents the point at which we migrated the VM to another host. Straight away the latency dropped and the VM returned to expected performance levels. This resembles a couple of the symptoms above…

We also experimented by removing all IOPS Disk Limits on a subset of NFS datastores and looked at the affect on overall latency that had. The results where almost instant as you can see below. Both peaks represent us removing and then adding back in the IOPS Limits.

As we are running ESXi 5.1 hosts we applied the latest patch release (ESXi510-201406001) which includes ESXi510-201404401-BG that addresses the bug in 5.1. After applying that we did see a noticeable drop in overall latency on the previously affected NFS datastores.

Annoyingly there is no available patch for ESXi 5.5, but I have been told by VMware Support that it’s due as of Update 2 for 5.5..no time frame on that release though.

One thing I’m interested in comments on is around VM Virtual Disk IOPS limits… Designed to lesson the impact of noisy neighbours…but what overall affect can it have…or does it have to LUN based latency? Or does it self contain latency to the VM thats restricted? I assume it works differently to SIOC and doesn’t choke disk queue depth? The IOPS limit simply puts the breaks on any Virtual Disk IO?

Nested ESXi – Reduced Network Throughput with Promiscuous Mode PortGroups

We have been conducting performance and stress testing of a new NFS connected storage platform over the past month or so and through that testing we have seen some interesting behaviours in terms of what effects overall performance relative to total network throughput vs Latency vs IOPS.

Using a couple of stress test scripts, we have been able to bottleneck the performance at the network layer…that is to say, all conditions being equal we can max out a 10Gig Uplink across a couple hosts resulting in throughput of 2.2GB/s or ~20,000Mbits/s on the switch. Effectively the backend storage (at this stage) is only limited by the network.

To perform some “real world” testing I deployed a Nested ESXi Lab comprising of our vCloud Platform into the environment. For the Nested ESXi Hosts (ESXi 5.0 Update 2) networking I backed 2x vNICs with a Distributed Switch PortGroup configured as a Trunk accepting all VLANs and Accepting Promiscuous Mode traffic.

While continuing with the load testing in the parent environment (not within the nested ESXi) we started to see throughput issues while running the test scripts. Where we had been getting 1-1.2GB/s of throughput per host, all of a sudden that had dropped to 200-300MB/s. Nothing else had changed on the switching so we where left scratching our heads.

From the above performance graph, you can see that performance was 1/5 of what we had seen previously and it looked like there was some invisible rate limiting going on. After accusing the network guy (@sjdix0n) I started to switch off VMs to see if there was any change to the throughput. As soon as all VMs where switched off, throughput returned to the expected Uplink saturation levels.

As soon as I powered on one of the nested ESXi hosts performance dropped from 1.2GB/s to about 450MB/s. As soon as I turned on the other nested host it dropped to about 250MB/s. In reverse you can see that performance stepped back up to 1.2GB/s with both nested hosts offline.

So…what’s going on here? The assumption we made was that Promiscuous Mode was causing saturation within the Distributed vSwitch causing the performance test to record the much lower values and not be able to send out traffic at expected speeds.

Looking at one of the nested hosts Network Performance graphs you can see that it was pushing traffic on both vNICs roughly equal to the total throughput of that the test was generating…the periods of non activity above was when we where starting and stopping the load tests.

In a very broad sense we understand how and why this is happening, but it would be handy to get an explanation as the the specifics of why having a setup like this can have such a huge effect on overall performance.

UPDATE WITH EXPLANATION: After posting this initially I reached out to William Lam (@lamw) via Twitter and he was kind enough to post an article explaining why Promiscuous Mode and Forged Transmits are required in Nested ESXi environments.

http://www.virtuallyghetto.com/2013/11/why-is-promiscuous-mode-forged.html

Word to the wise about having Nested Hosts…it can impact your environment in ways you may not expect! Is it any wonder why nested hosts are not officially supported?? 🙂