Tag Archives: esxi

vSphere 6.5 – Whats in it for Service Providers Part 1

Last week after an extended period of development and beta testing VMware released vSphere 6.5. This is a lot more than a point release and is a major major upgrade from vSphere 6.0. In fact, there is so much packed into this new release that there is an official whitepaper listing all the features and enhancements that had been linked from the release notes.  I thought I would go through some of the key features and enhancements that are included in the latest versions of vCenter and ESXi and as per usual I’ll go through those improvements that relate back to the Service Providers that use vSphere as the foundation of their Managed or Infrastructure as a Service offerings.

Generally the “whats new” would fit into one post, however having gotten through just the vCenter features it became apparent that this would have to be a multi-post series…this is great news for vCloud Air Network Service Providers out there as it means there is a lot packed in for IaaS and MSPs to take advantage of.

With that, in this post will cover the following:

  • vCenter 6.5 New Features
  • vCD and NSX Compatibility
  • Current Known Issues

vCenter 6.5 New Features:

Without question the enhancements to the VCSA stand out as one of the biggest features of 6.5 and as mentioned in the whitepaper, the installer process has been overhauled and is a much smoother, streamlined experience than with previous versions. It’s also supported across more operating systems and the 6.5 version of vCenter now surpasses the Windows version offering the migration tool, native high availability and built in backup and restore. One interesting sidenote to the new VCSA is that the HTML5 vSphere Client has shipped, though it’s still very much a work in progress as a lot of unsupported functionality mentioned in the release notes…there is lots of work to do to bring it up to parity with the Flex Web Client.

In terms of the inbuilt PostGreSQL database I think it’s time that Service Providers feel confident in making the switch away from MSSQL (which was the norm with Windows based vCenters) as the enhanced VCSA Management Interface (found on port 5480) has a new monitoring screen showing information relating to disk space usage and also provides a way to gracefully start and stop the database engine.

Other vCenter enhancements that Service Providers will make use of is the High availability feature which is something a lot of people have been asking for a long time. For me, I always dealt with the no HA constraint in that vCenter may become unavailable for 5-10 minutes during maintenance or at worse an extended outage while recovering from a VM or OS level failure. Knowing that hosts and VMs are still working and responding with vCenter down leaving only core management functionality unavailable it was a risk myself and others were willing to take. However, in this day of the always on datacenter it’s expected that management functionality be as available at IaaS services…so with that, this HA feature is well welcomed for Service Providers.

This native HA solution is available exclusively for the VCSA and the solution consists of active, passive, and witness nodes that are cloned from the existing vCenter Server instance. The HA cluster can be enabled, disabled, or destroyed at any time. There is also a maintenance mode that prevents planned maintenance from causing an unwanted failover.

The VCSA Migration Tool that was previously released in 6.0 Update 2m is shipped in the VCSA ISO and can be used to migrate from Windows based 5.5 vCenter’s to the 6.5 VCSA. Again this is something that more and more service providers will take advantage of as the reliance on Windows based vCenters and MSSQL becomes more and more something that’s unwanted from a manageability and cost point of view. Throw in the enhanced features that have only been released for the VCSA and this is a migration that all service providers should be planning.

To complete the move away from any Windows based dependencies the vSphere Update Manager has also been fully integrated into the VCSA. VUM is now fully integrated into the Web Client UI and is enabled by default. For larger environments with a large numbers of hosts AutoDeploy is now fully manageable from the VCSA UI and doesn’t require PowerCLI to manage or configure it’s options. There is a new image builder included in the UI that can hit local or public repositories to pull images or drivers and there are performance enhancements during deployments of ESXi images to hosts.

vCD and NSX Compatibility:

Shifting from new features and enhancements to an important subject to talk about when talking service provider platform…VMware product compatibility. For those vCAN Service Providers running a Hybrid Cloud you should be running a combination of vCloud Director SP or/and NSX-v of which, at the moment there is no support for either in vSphere 6.5. No compatible versions of NSX are available for vSphere 6.5. If you attempt to prepare your vSphere 6.5 hosts with NSX 6.2.x, you receive an error message and cannot proceed.

I haven’t tested to see if vCloud Director SP will connect and interact with vCenter 6.5 or ESXi 6.5 however as it’s not supported I wouldn’t suggest upgrading production IaaS platforms until the interoperability matrix’s are updated.

At this stage there is no word on when either product will support vSphere 6.5 but I suspect we will see NSX-v come out with a supported build shortly…though I’m expecting vCloud Director SP to no support 6.5 until the next major version release, which is looking like the new year.

Installation and Upgrade Known Issues:

Having read through the release notes, there are also a number of known issues you should be aware of. I’ve gone through those and pulled the ones I consider the most likely to be impactful to IaaS platforms.

  • After upgrading to vCenter Server 6.5, the ESXi hosts in High Availability clusters appear as Not Ready in the VMware NSX UI
    If your vSphere environment includes NSX and clusters configured with vSphere High Availability, after you upgrade to vCenter Server 6.5, both NSX and vSphere High Availability start installing VIBs on all hosts in the clusters. This might cause installation of NSX VIBs on some hosts to fail, and you see the hosts as Not Ready in the NSX UI.
    Workaround: Use the NSX UI to reinstall the VIBs.
  • Error 400 during attempt to log in to vCenter Server from the vSphere Web Client
    You log in to vCenter Server from the vSphere Web Client and log out. If, after 8 hours or more, you attempt to log in from the same browser tab, the following error results.
    400 An Error occurred from SSO. urn:oasis:names:tc:SAML:2.0:status:Requester, sub status:nullWorkaround: Close the browser or the browser tab and log in again.
  • Using storage rescan in environments with the large number of LUNs might cause unpredictable problems
    Storage rescan is an IO intensive operation. If you run it while performing other datastore management operation, such as creating or extending a datastore, you might experience delays and other problems. Problems are likely to occur in environments with the large number of LUNs, up to 1024, that are supported in the vSphere 6.5 release.Workaround: Typically, storage rescans that your hosts periodically perform are sufficient. You are not required to rescan storage when you perform the general datastore management tasks. Run storage rescans only when absolutely necessary, especially when your deployments include a large set of LUNs.
  • In vSphere 6.5, the name assigned to the iSCSI software adapter is different from the earlier releases
    After you upgrade to the vSphere 6.5 release, the name of the existing software iSCSI adapter, vmhbaXX, changes. This change affects any scripts that use hard-coded values for the name of the adapter. Because VMware does not guarantee that the adapter name remains the same across releases, you should not hard code the name in the scripts. The name change does not affect the behavior of the iSCSI software adapter.Workaround: None.
  • The bnx2x inbox driver that supports the QLogic NetXtreme II Network/iSCSI/FCoE adapter might cause problems in your ESXi environment
    Problems and errors occur when you disable or enable VMkernel ports and change the failover order of NICs for your iSCSI network setup.Workaround: Replace the bnx2x driver with an asynchronous driver. For information, see the VMware Web site.
  • When you use the Dell lsi_mr3 driver version 6.903.85.00-1OEM.600.0.0.2768847, you might encounter errors
    If you use the Dell lsi_mr3 asynchronous driver version 6.903.85.00-1OEM.600.0.0.2768847, the VMkernel logs might display the following message ScsiCore: 1806: Invalid sense buffer.Workaround: Replace the driver with the vSphere 6.5 inbox driver or an asynchronous driver from Broadcom.
  • Storage I/O Control settings are not honored per VMDK
    Storage I/O Control settings are not honored on a per VMDK basis. The VMDK settings are honored at the virtual machine level.Workaround: None.
  • Cannot create or clone a virtual machine on a SDRS-disabled datastore cluster
    This issue occurs when you select a datastore that is part of a SDRS-disabled datastore cluster in any of the New Virtual Machine, Clone Virtual Machine (to virtual machine or to template), or Deploy From Template wizards. When you arrive at the the Ready to Complete page and click Finish, the wizard remains open and nothing appears to occur. The Datastore value status for the virtual machine might display “Getting data…” and does not change.Workaround: Use the vSphere Web Client for placing virtual machines on SDRS-disabled datastore clusters.

These are just a few, that I have singled out…it’s worth reading through all the known issues just in case there are any specific issues that might impact you.

In the next post in this vSphere 6.5 for Service Providers series I will cover, more vCenter features as well as ESXi enhancements and what’s new in Core Storage.

References:

http://pubs.vmware.com/Release_Notes/en/vsphere/65/vsphere-esxi-vcenter-server-65-release-notes.html

http://www.vmware.com/content/dam/digitalmarketing/vmware/en/pdf/whitepaper/vsphere/vmw-white-paper-vsphr-whats-new-6-5.pdf

http://pubs.vmware.com/Release_Notes/en/vsphere/65/vsphere-client-65-html5-functionality-support.html

vSphere 6 Update 2 – Whats In It for Service Providers

It’s been just over a week since VMware released vSphere 6 Update 2 and I thought I would go through some of the key features and fixes that are included in the latest versions of vCenter and ESXi. As usual I generally keep an eye out for improvements that relate back to Service Providers who use vSphere as the foundation of their Managed or Infrastructure as as Service offerings.

New Features:

Without question the biggest new feature is the release of VSAN 6.2. I’ve covered this release in previous blog posts and when you upgrade to ESXi 6.0 Update 2 the VSAN 6.2 bits are present within the kernel. Some VSAN services are actually in play regardless if you have it configured or not…which is interesting. With the new pricing for VSAN through the vCAN program, Service Providers now can seriously think about deploying VSAN for their main IaaS platforms.

The addition of support for High Speed Ethernet Links is significant not only because of the addition of 25G and 50G link speeds means increased throughput for converged network cards allowing for more network traffic to flow through hosts and switches for Fault Tolerance, vMotion, Storage vMotions and storage traffic but also because it allows SPs to think about building Edge Clusters for networking services such as NSX and allow the line speeds to take advantage of even higher backends.

From a manageability point of view the Host Client HTML5 user interface is a welcome addition and hopefully paves the way for more HTML5 management goodness from VMware for not only hosts…but also vCenter its self. There is a fair bit of power already in the Host Client and I can bet that admins will start to use it more and more as it continues to evolve.

For vCenter the addition of Two-Factor Authentication using RSA or Smartcard technology is an important feature for SPs to use if they are considering any sort of certification for their services. For example many government based certifications such as IRAP require this to be certified.

Resolved Issues:

There are a bunch of resolved issues in this build and I’ve gone through the rather extensive list to pull out the biggest fixes that relate to my experience in service provider operations.

vCenter:

  • Upgrading vCenter Server from 5.5 Update 3b to 6.0 Update 1b might fail if SSLv3 is disabled on port 7444 of vCenter Server 5.5 Update 3b. An upgrade from vCenter Server 5.5 Update 3b to 6.0 Update 2 works fine if SSLv3 is disabled by default on 7444 port of vCenter Server 5.5 Update 3b.
  • Deploying a vApp on vCloud Director through the vApp template fails with a Profile-Driven storage error. When you refresh the storage policy, an error message similar to the following is displayed: The entity vCenter Server is busy completing an operation.
  • Delta disk names of the source VM are retained in the disk names of the cloned VM. When you create a hot clone of a VM that has one or more snapshots, the delta disk names of the source VM are retained in the cloned VM
  • vCenter Server service (vpxd) might fail during a virtual machine power on operation in a Distributed Resource Scheduler (DRS) cluster.

ESXi:

  • Hostd might stop responding when you execute esxcli commands using PowerCLI resulting in memory leaks and memory consumption exceeding the hard limit.
  • ESXi mClock I/O scheduler does not work as expected. The ESXi mClock I/O scheduler does not limit the I/Os with a lesser load even after you change the IOPS of the VM using the vSphere Client.
  • After you upgrade Virtual SAN environment to ESXi 6.0 Update 1b, the vCenter Server reports a false warning similar to the following in the Summary tab in the vSphere Web Client and the ESXi host shows a notification triangle
  • Attempts to perform vMotion might fail after you upgrade from ESXi 5.0 or 5.1 to 6.0 Update 1 or later releases. An error message similar to the following is written to the vmware.log file.
  • Virtual machine performance metrics are not displayed correctly as the performance counter cpu.system.summation for a VM is always displayed as 0
  • Attempts to perform vMotion with ESXi 6.0 virtual machines that have two 2 TB virtual disks created on ESXi 5.0 fail with an error messages similar to the following logged in the vpxd.log file:2015-09-28T10:00:28.721+09:00 info vpxd[xxxxx] [[email protected] sub=vpxLro opID=xxxxxxxx-xxxxxxxx-xx] [VpxLRO] — BEGIN task-919 — vm-281 — vim.VirtualMachine.relocate — xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx(xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx)

The mClock fix highlighted above is a significant fix for those that where looking to use IOPS limiting. It’s basically been broken since 5.5 Update 2 and also impacts/changes the way in which you would think IOPS are interpreted through the VM to storage stack. For service providers looking to introduce IOPS limited to control the impact noisy neighbors the fix is welcomed.

As usual there are still a lot of known issues and some that have been added or updated to the release notes since release date. Overall the early noise coming out from the community is that this Update 2 release is relatively solid and there have been improvements in network performance and general overall stability. Hopefully we don’t see a repeat of the 5.5 Update 2 issues or the more recent bug problems that have plagued previous released…and hopefully not more CBT issues!

vSphere 6.0 Update 2 has a lot of goodness for Service Providers and continues of offer the number one vitalization platform from which to build managed and hosted services on top of. Go grab it now and put it through it’s paces before pushing to production!

References:

http://pubs.vmware.com/Release_Notes/en/vsphere/60/vsphere-esxi-60u2-release-notes.html

http://pubs.vmware.com/Release_Notes/en/vsphere/60/vsphere-vcenter-server-60u2-release-notes.html

http://pubs.vmware.com/Release_Notes/en/vsan/62/vmware-virtual-san-62-release-notes.html

http://pubs.vmware.com/Release_Notes/en/vsphere/60/vmware-host-client-10-release-notes.html

Heads Up: Heavy VXLAN Traffic Causing Broadcom 10GB NICS to Drop

For the last couple of weeks we have had some intermittent issues where by ESXi network adapters have gone into a disconnected state requiring a host reboot to bring the link back online. Generally it was only one NIC at a time, but in some circumstances both NICs went offline resulting in host failure and VM HA events being triggered. From the console ESXi appears to be up, but each NIC was listed as disconnected and when we checked the switch ports there was no indication of a loss of link.

In the vmkernal logs the following entries are observed:

After some time working with VMware Support our Ops Engineer @santinidaniel came aross this VMwareKB which described the situation we where seeing. Interestingly enough we only saw this happening after recent host updates to ESXi 5.5 Update 3 builds but as the issue is listed as being present in ESXi 5, 5.5 and 6.0 that might just be a side note.

The cause as listed in the KB is:

This issue occurs when the guest virtual machine sends invalid metadata for TSO packets. The packet length is less than Maximum Segment Size (MSS), but the TSO bit is set. This causes the adapter and driver to go into a non-operational state.

Note: This issue occurs only with VXLAN configured and when there is heavy VXLAN traffic.

It just so happened that we did indeed have a large customer with high use Citrix Terminal Servers using our NSX Advanced Networking…and they where sitting on a VXLAN Virtualwire. The symptoms got worse today that coincided with the first official day of work for the new year.

There is a simple workaround:

That command has been described in blog posts relating to the Broadcom (which now present as QLogic drivers) drivers and where previously there was no resolution, there is now a fix in place by upgrading to the latest drivers here. Without upgrading to the latest certified drivers the quickest way to avoid the issue is to apply the workaround and reboot the host.

There has been recent outcry bemoaning the lack of QA with some of VMware’s latest releases but the reality is the more bits you add the more likelihood there is for issues to pop up…This is becoming more the case with ESXi as the base virtualization platform continues to add to it’s feature set which now includes VSAN baked in. Host extensions further add to the chance of things going wrong due to situations that are hard to test in as part of the QA process.

Deal, fix…and move on!

References:

http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2114957

https://my.vmware.com/web/vmware/details?downloadGroup=DT-ESXI55-QLOGIC-BNX2X-271250V556&productId=353

 

vSphere 5.5 Update 3 Released: Features and Top Fixes

vSphere 5.5 Update 3 was released earlier today and there are a bunch of bug fixes and feature improvements in this update release for both vCenter and ESXi. For most Service Providers updating to vSphere 6.0 is still a while away so it’s good to have continued support and improvement for the 5.5 platform. I’ve scanned through the release notes and picked out what I consider some of the more important bug fixes and resolved issues as they pertain to my deployments of vSphere.

Note: Still appears that there is no resolution to the vMotion errors I reported on earlier in the year or the bugs around the mClock Scheduler and IOPS Limiter on NFS.

ESXi 5.5 Update 3:

  • Status of some disks might be displayed as UNCONFIGURED GOOD instead of ONLINEStatus of some disks on an ESXi 5.5 host might be displayed as UNCONFIGURED GOOD instead of ONLINE. This issue occurs for LSI controller using the LSI CIM provider.
  • Cloning CBT-enabled virtual machine templates from ESXi hosts might failAttempt to clone CBT-enabled virtual machines templates simultaneously from two different ESXi 5.5 hosts might fail. An error message similar to the following is displayed:Failed to open VM_template.vmdk': Could not open/create change tracking file (2108).
  • ESXi hosts with the virtual machines having e1000 or e1000e vNIC driver might fail with a purple screenESXi hosts with the virtual machines having e1000 or e1000e vNIC driver might fail with a purple screen when you enable TCP segmentation Offload (TSO). Error messages similar to the following might be written to the log files:cpu7:nnnnnn)Code start: 0xnnnnnnnnnnnn VMK uptime: 9:21:12:17.991 cpu7:nnnnnn)0xnnnnnnnnnnnn:[0xnnnnnnnnnnnn][email protected]#nover+0x65b stack: 0xnnnnnnnnnn cpu7:nnnnnn)0xnnnnnnnnnnnn:[0xnnnnnnnnnnnn][email protected]#nover+0x18ab stack: 0xnnnnnnnnnnnn cpu7:nnnnnn)0xnnnnnnnnnnnn:[0xnnnnnnnnnnnn][email protected]#nover+0xa2 stack: 0xnnnnnnnnnnnn cpu7:nnnnnn)0xnnnnnnnnnnnn:[0xnnnnnnnnnnnn][email protected]#nover+0xae stack: 0xnnnnnnnnnnnn cpu7:nnnnnn)0xnnnnnnnnnnnn:[0xnnnnnnnnnnnn][email protected]#nover+0x488 stack: 0xnnnnnnnnnnnn cpu7:nnnnnn)0xnnnnnnnnnnnn:[0xnnnnnnnnnnnn][email protected]#nover+0x60 stack: 0xnnnnnnnnnnnnnnn cpu7:nnnnnn)0xnnnnnnnnnnnn:[0xnnnnnnnnnnnn][email protected]#nover+0x185 stack: 0xnnnnnnnnnnnn
  • Attempts to reboot Windows 8 and Windows 2012 server on ESXi host virtual machines might failAfter you reboot, the Windows 8 and Windows 2012 Server virtual machines might become unresponsive when the Microsoft Windows boot splash screen appears. For more information refer, Knowledge Base article 2092807.
  • Attempts to install or upgrade VMware Tools on a Solaris 10 Update 3 virtual machine might fail
    Attempts to install or upgrade VMware Tools on a Solaris 10 Update 3 virtual machine might fail with the following error message:Detected X version 6.9
    Could not read /usr/lib/vmware-tools/configurator/XOrg/7.0/vmwlegacy_drv.so Execution aborted.This issue occurs if the vmware-config-tools.pl script copies the vmwlegacy_drv.so file, which should not be used in Xorg 6.9.

In going through the remaining Known Issues you come across a lot of Flash Read Cache related problems…maybe VMware should call it a day with this feature…not sure if anyone has the balls to actually use it in production…be interested to hear? There are also a lot of VSAN issues still being reported as known with workarounds in place…all the more reason to start a VSAN journey with vSphere 6.0.

For a look at what’s new and for the release notes in full…click on the links below:

VMware ESXi™ 5.5 Update 3 | 16 SEP 2015 | Build 3029944

vCenter Server 5.5 Update 3 | 16 SEP 2015 | Build 3000241

Leap Second Bug: Worth a Double Check…

In 2008 I vividly remember the impact that leap year/day/seconds can have on systems that are not prepared to handle the changes in time or date. It was the 29th of February and at the time I was working for a Service Provider offering Hosted Exchange services based on Exchange 2007. All off a sudden my provisioning scripts stopped working and we could not add, remove or modify Exchange Mailboxes.

After a day of frustration working with MS Support and dreading a full system rebuild the problem seemed to disappear the following day…the 1st of March. At the end of the day and after a couple of days of Microsoft scratching their head the Exchange Engineering team realised that they hadn’t allowed for the leap day somewhere deep in the bowls of their code which resulted in all account modifications not working during the 24 hours of the leap day.

Fast forward five years and the Earth’s rotation continues to slow and we have a situation where system administrators and operations teams need to be aware of another out of the norm situation that could affect systems and platforms. This time it’s due to a leap second adjustment which is scheduled for 30th of June 2015 at 23:59:60 UTC and it may cause issues for devices and operating systems that are NTP synchronised. Older Linux kernels seem to be the most affected by leap second with most vendors releasing KB articles regarding the leap second impact and how to work around it.

While this is not something that will bring down the internet it’s still something that all infrastructure IT professionals should be aware of and be double checking all systems to ensure there are no embarrassing time related incidents come the 30th of June.

ESXi and Other VMware Products:

As per this KB, ESXi is not impacted by the leap second bug…but other appliance based solutions (mostly SUSE based) look to require the enabling of Slew Mode for NTP.

ESX/ESXi utilizes the RFC-1589 clock model, appropriately handling leap seconds.

It is not necessary to enable Slew Mode for NTP in ESX/ESXi’s NTP client, or to otherwise work around leap seconds by disabling and re-enabling the NTP client before and after the leap second’s occurrence. For more information, see Enabling Slew Mode for NTP (2121016).

However, while ESX/ESXi server is not expected to experience negative impact from a leap second taking place, it remains possible for Guest Operating Systems and/or running applications to experience an impact, independent of ESX/ESXi, if it is not designed to handle one. VMware recommends customers to test their complete solutions.

This KB lists all the affected platforms and the suggested fixes for them. For vCloud SPs running vCloud Director… as most Cells run off Red Hat Enterprise there should be no impact, however it’s worth double checking as time skew is the number one enemy of vCloud Director IaaS platforms.
.
Service Providers:
While most Cloud providers don’t manage client Operating Systems directly it would be a good move to put out some form of advisory so that clients protect their VMs before the leap second hits…if not there could be a lot of angry service desk calls relating to increased and unexplained CPU usage, application slowdowns, application crashes, and failures on startup.
.

Read more

Quick Post: One Stop Shop for ESXi Driver Downloads

Today I needed to update an Emulex NIC Driver for an new host that I installed using the VMware ESXi 5.5 Update 2 base image. I needed to chase up the latest OEM update bundle for the elxnet drivers… Generally sourcing these driver bundles can be a bit of a pain but I remembered a conversation I had last week with @dmanconi where he gave me a hot tip on a location “hidden in plain sight” where you have access to all the latest driver bundle updates for what appears to be the most common network and storage adapters for ESXi.

This is located under the Horizon View 5.x Download Page on the MyVMware Website under the Drivers & Tools Tab.

https://my.vmware.com/web/vmware/info/slug/desktop_end_user_computing/vmware_horizon_view/5_3#drivers_tools

As you can see above, the release dates for the drivers are as recent as a couple of days ago and the list is extensive. As usual the recommendation where possible is that you use specific Vendor release drivers for your production systems, but otherwise…they are all here at your fingertips!

Enjoy.

January vCenter and ESXi 5.5 Patch Releases: Critical CBT Bug Fixed

VMware released new builds for vCenter and ESXi 5.5 today. The builds contain mostly bug fixes, but I wanted to point out one fix that had affected those who use CBT as part of their VM backup strategy. Veeam users where initially affected by the bug…though Veeam released a work around in subsequent Veeam 8 builds this ESXi patch officially fixes the issue.

When you use backup software that uses the Virtual Disk Development Kit (VDDK) API call QueryChangedDiskAreas(), the list of allocated disk sectors returned might be incorrect and incremental backups might appear to be corrupt or missing. A message similar to the following is written to vmware.log:

DISKLIB-CTK: Resized change tracking block size from XXX to YYY

For more information, see KB 2090639.

There are also a couple of other interesting fixes in the build:

  • Storage vMotion of thin provisioned virtual machine disk (VMDK) takes longer than the Lazy Zeroed Thick (LZT) disk.
  • When a quiesced snapshot is created on a Windows Server 2003, Windows Server 2003 R2, Windows Server 2008, Windows Server 2008 R2, or a Windows Server 2012 virtual machine, duplicate disks might be created in the virtual machine.
  • On an ESXi 5.5 host, the NFS volumes might not restore after the host reboots. This issue occurs if there is a delay in resolving the NFS server host name to IP address.

For those who backup VMs with disks larger that 128GB I would be looking to deploy this patch ASAP.

ESXi 5.5: IOPS Limit and mClock scheduler

It’s fair to say that it’s not very often the difference between a 1 and 0 can have such a massive impact on performance…I had read a couple of posts from Duncan Epping in regards to a new DiskIO Schedular introduced in ESXi 5.5 and what it meant for VM Disk Limits compared to how things where done in previous 5.x versions of ESXi.

As mentioned in the posts linked above the ESxi Host Advanced Setting Disk.SchedulerWithReservation is activated by default in ESXi 5.5, so without warning in an upgrade scenario where IOPS limits are used the rules change…and from I found out last night the results are not pretty.

The graphic above is showing the total latency of a VM (On an NFS datastore with an IOPS limit set) running firstly on ESXi 5.1 up until the first arrow when the VM was vMotioned to an in place upgraded host now running ESXi 5.5 Update 2. As soon as the VM lands on the 5.5 host latency skyrockets and remains high until there is another vMotion to another 5.5 host. Interestingly enough the overall latency on the other host isn’t as high as the first but grows in a linear fashion.

The second arrow represents the removal of the IOPS Limits at which point latency falls to its lowest levels in the past 24 hours. The final arrows represents a test where the IOPS Limits was reapplied and the advanced setting Disk.SchedulerWithReservation was set to 0 on the host.

To make sense of the above it seems that the mClock Scheduler being on caused the applied IOPS Limit to generate artificial latency against the VM rendering it pretty much useless. The reasons for this are unknown at this point, but in reading Duncan’s blog on the IOPS Limit Caveat it would seem that due to the new 5.5 mClock Scheduler looking at application level block sizes to work out the IO the applied limit actually crippled the VM. In this case the limit was set to 250 IOPS, so I am wondering if there is some unexpected behaviour happening here even if larger block sizes are being used in the VM.

Suffice to say it looks like the smart thing to be doing is set the Disk.SchedulerWithReservation to 0 and revert back to 5.0/1 behaviour with IOPS Limits in place. If you want to do that on bulk the following PowerCLI command will do the trick.

One more interesting observation I made is that it appears VMs with IOPS limits on iSCSI datastores are not/less effected…however most large NFS datastores with a large number of VMs where. You can see below what happens to datastore latency when I switched off the mClock Scheduler…latency dropped instantly.

I’m not sure if this indicates more general NFS issues with ESXi 5.5…there seems to have been more than a few since Update 1 came out. I’ve reached out to see if this behaviour can be explained so hopefully I can provide an update when that information comes to hand…again, I’m not too sure what the benefit of this change in behaviour is so I’m hoping someone who has managed to digest the Brain Hurting Academic Paper to explain why this was introduced.

In laymen’s terms…an IO is not an IO in the world of VM IOPS Limits anymore.

 

ESXi 5.5 Update 2: vMotion Fails at 14% with Stale Admission Control and VM Reservations

We are currently in the process of upgrading all of our vCenter Clusters from ESXi 5.1 to 5.5 Update 2 and have come across a bug whereby the vMotion of VMs from the 5.1 hosts to the 5.5 hosts fails at 14% with the following error:

[UPDATE 6/10] –

Observed Conditions:

  • vCenter 5.5 U1/2 (U2 resulted in less 14% stalls, but still occurring)
  • Mixed Cluster of ESXi 5.1 and 5.5 Hosts (U1 or U2)
  • Has been observed happening in fully upgraded 5.5 U2 Cluster
  • VMs have various vCPU and vRAM configuration
  • VMs have vRAM Reservations Unlimited vRAM/vCPU
  • VMs are vCD Managed

Observed Workarounds:

  • Restart Management Agents on vMotion Destination Host (hit + miss)
  • vMotion VM to 5.1 Host if available
  • Remove vRAM Reservation and Change to Unlimited vCPU/vRAM
  • Stop and start VM on different host (not ideal)

We are running vCenter 5.5 Update 1 with an number of  Clusters that where on ESXi 5.1 of which some act as Provider vDCs for vCloud Director. Upgrading the Clusters which are not vCloud Providers (meaning VMs aren’t vCD managed or have vCD reservations applied) didn’t result in the issue and we where able to upgrade all hosts to ESXi 5.5 Update two without issue.

There seemed to be no specific setting or configuration of the VMs that ultimatly got stuck during a vMotion from a 5.1 to 5.5 host however they all have memory reservations of various sizes based on our vCloud Allocation Pool settings.

Looking through the host.d logs on the 5.5 Host acting as the destination for the vMotion we see the following entry:

Some of the key entries where

and
After trying to scour the web for guidance we came across this VMwareKB which listed the error specifically…but specifies VMs already powered off and not ones powered on and under vMotion…in any case the resolutions that are suggested not viable in our environment…namely the stopping and starting of VMs and rejigging of the Memory Reservation settings.
We engaged VMware Support and raised a Support Request…after a little time working with support it was discovered that this may be an internal bug which first appeared as fixed in ESXi 5.5 Extras Patch 4 but appears to have slipped past the Update 2 release. The bug relates to stale Admission Control Constraints and the workaround suggested was to restart the management agents of the destination 5.5 host…in addition to that if this was occurring in a vCloud Provider Cluster it was suggested that the Host be disabled from within vCloud Director and then trigger the Redeploy All VMs action as shown below
This effectively triggers a Maintenance Mode task in vCenter against the host, which for mine is no different to triggering a Maintenance Mode from vCenter directly…however the specific workaround in vCD environments was to use this method in conjunction with the destination Host management agent restart.
Results have been mixed up to this point and we have faced VMs that simply wont vMotion at all even after trying every combination stated above. Our own workaround has been to shuffle those VMs to other 5.1 hosts in the cluster and hope that they will vMotion to a 5.5 host as we roll through the rest of the upgrades. Interestingly enough we have also seen random behaviour where if, for example 6 VMs are stuck at 14%…after a agent restart only 4 of them might be affected and in subsequent restarts it might only be 2…this just tells me that the big is fairly hit and miss and needs some more explaining as to the specific circumstance and reasoning that trigger the condition.
We are going to continue to work with VMware Support to try and get a more scientific workaround until the bug fix is released and I will update this post with the outcomes of both of those action items when they become available…its more of a rather inconvenient bug… but still…VM uptime is maintained and that’s all important.
If anyone has had similar issues feel free to comment:

ESXi 5.x NFS IOPS Limit Bug – Latency and Performance Hit

There is another NFS bug hidden in the latest ESXi 5.x releases…while not as severe as the 5.5 Update 1 NFS Bug it’s been the cause of increased Virtual Disk Latency and Overall Poor VM performance across a couple of the environments I manage.

The VMwareKB Article referencing the bug can be found here:

Symptoms

  • When virtual machines run on the same host and use the same NFS datastore and the IOPS limit is set on at least one virtual machine, you experience high virtual disks latency and low virtual disk performance.
  • Even when a different IOPS limit is specified for different virtual machines, IOPS limit on all virtual machines is set to the lowest value assigned to a virtual machine in the NFS datastore.
  • You do not experience the virtual disks latency and low virtual disk performance issues when virtual machines:
    • Reside on the VMFS datastore.
    • Run on the same ESXi host, but are placed on the NFS datastore in which none of the virtual machines have an IOPS limit set.
    • Run on the same ESXi host, but are placed on different NFS datastores that do not share the client connection
    • Run on different ESXi hosts, but are placed on the same NFS datastore

So in a nutshell if you have a large NFS Datastore with many VMs with IOPS Limits to prevent against Noisy Neighbours you may experience, what looks like unexplained VM Latency and overall bad performance.

Some proof of this bug can be seen in the screenshot below, where a VM residing on an NFS datastore with IOPS limits applied was exhibiting high Disk Command Latency. It had an IOPS limit of 1000 and wasn’t constrained by setting…yet it had Disk Command Latency in the 100s. The red arrow represents the point at which we migrated the VM to another host. Straight away the latency dropped and the VM returned to expected performance levels. This resembles a couple of the symptoms above…

We also experimented by removing all IOPS Disk Limits on a subset of NFS datastores and looked at the affect on overall latency that had. The results where almost instant as you can see below. Both peaks represent us removing and then adding back in the IOPS Limits.

As we are running ESXi 5.1 hosts we applied the latest patch release (ESXi510-201406001) which includes ESXi510-201404401-BG that addresses the bug in 5.1. After applying that we did see a noticeable drop in overall latency on the previously affected NFS datastores.

Annoyingly there is no available patch for ESXi 5.5, but I have been told by VMware Support that it’s due as of Update 2 for 5.5..no time frame on that release though.

One thing I’m interested in comments on is around VM Virtual Disk IOPS limits… Designed to lesson the impact of noisy neighbours…but what overall affect can it have…or does it have to LUN based latency? Or does it self contain latency to the VM thats restricted? I assume it works differently to SIOC and doesn’t choke disk queue depth? The IOPS limit simply puts the breaks on any Virtual Disk IO?

« Older Entries