Tag Archives: esxi

ESXi 6.5 Storage Performance Issues and Fix

[NOTE] : I decided to republish this post with a new heading and skip right to the meat of the issue as I’ve had a lot of people reach out saying that the post helped them with their performance issues on ESXi 6.5. Hopefully people can find the content easier and have a fix in place sooner.

The issue that I came across was to do with storage performance and the native driver that comes bundled with ESXi 6.5. With the release of vSphere 6.5 yesterday, the timing was perfect to install ESXI 6.5 and start to build my management VMs. I first noticed some issues when uploading the Windows 2016 ISO to the datastore with the ISO taking about 30 minutes to upload. From there I created a new VM and installed Windows…this took about two hours to complete which I knew was not as I had expected…especially with the datastore being a decent class SSD.

I created a new VM and kicked off a new install, but this time I opened ESXTOP to see what was going on, and as you can see from the screen shots below, the Kernel and disk write latencies where off the charts topping 2000ms and 700-1000ms respectively…In throuput terms I was getting about 10-20MB/s when I should have been getting 400-500MB/s. 

ESXTOP was showing the VM with even worse write latency.

I thought to myself if I had bought a lemon of a storage controller and checked the Queue Depth of the card. It’s listed with a QD of 31 which isn’t horrible for a homelab so my attention turned to the driver. Again referencing the VMware Compatibility Guide the listed driver for the controller the device driver is listed as ahci version 3.0.22vmw.

I searched for the installed device driver modules and found that the one listed above was present, however there was also a native VMware device drive as well.

I confirmed that the storage controller was using the native VMware driver and went about disabling it as per this VMwareKB (thanks to @fbuechsel who pointed me in the right direction in the vExpert Slack Homelab Channel) as shown below.

After the host rebooted I checked to see if the storage controller was using the device driver listed in the compatibility guide. As you can see below not only was it using that driver, but it was now showing the six HBA ports as opposed to just the one seen in the first snippet above.

I once again created a new VM and installed Windows and this time the install completed in a little under five minutes! Quiet a difference! Upon running a crystal disk mark I was now getting the expected speeds from the SSDs and things are moving along quiet nicely.

Hopefully this post saves anyone else who might by this, or other SuperMicro SuperServers some time and not get caught out by poor storage performance caused by the native VMware driver packaged with ESXi 6.5.


References
:

https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2044993

Released: vCenter and ESXi 6.0 Update 3 – What’s in It for Service Providers

Last month I wrote a blog post on upgrading vCenter 5.5 to 6.0 Update 2 and during the course of writing that blog post I conducted a survey on which version of vSphere most people where seeing out in the wild…overwhelmingly vSphere 6.0 was the most popular version with 5.5 second and 6.5 lagging in adoption for the moment. It’s safe to assume that vCenter 6.0 and ESXi 6.0 will be common deployments for some time in brownfield sites and with the release of Update 3 for vCenter and ESXi I thought it would be good to again highlight some of the best features and enhancements as I see them from a Service Provider point of view.

vCenter 6.0 Update 3 (Build 5112506)

This is actually the eighth build release of vCenter 6.0 and includes updated TLS support for v1.0 1.1 and 1.2 which is worth a look in terms of what it means for other VMware products as it could impact connectivity…I know that vCloud Director SP now expects TLSv 1.1 by default as an example. Other things listed in the What’s New include support for MSSQL 2012 SP3, updated M2VCSA support, timezone updates and some changes to the resource allocation for the platform services controller.

Looking through the Resolved Issue there are a number of networking related fixes in the release plus a few annoying problems relating to vMotion. The ones below are the main ones that could impact on Service Provider operations.

  • Upgrading vCenter Server from version 6.0.0b to 6.0.x might fail. 
    Attempts to upgrade vCenter Server from version 6.0.0b to 6.0.x might fail. This issue occurs while starting service An error message similar to the following is displayed in the run-updateboot-scripts.log file.
    “Installation of component VCSServiceManager failed with error code ‘1603’”
  • Managing legacy ESXi from the vCenter Server with TLSv1.0 disabled is impacted.
    vCenter Server with TLSv1.0 disabled supports management of legacy ESXi versions in 5.5.x and 6.0.x. ESXi 5.5 P08 and ESXi 6.0 P02 onwards is supported for 5.5.x and 6.0.x respectively.
  • x-VC operations involving legacy ESXi 5.5 host succeeds.
    x-VC operations involving legacy ESXi 5.5 host succeeds. Cold relocate and clone have been implicitly allowed for ESXi 5.5 host.
  • Unable to use End Vmware Tools install option using vSphere Client.
    Unable to use End VMware Tools install option while installing VMware Tools using vSphere Client. This issue occurs after upgrading to vCenter Server 6.0 Update 1.
  • Enhanced vMotion fails to move the vApp.VmConfigInfo property to destination vCenter Server.
    Enhanced vMotion fails to move the vApp.VmConfigInfo property to destination vCenter Server although virtual machine migration is successful.
  • Storage vMotion fails if the VM is connected with a CD ISO file.
    If the VM is connected with a CD ISO file, Storage vMotion fails with an error similar to the following:
  • Unregistering an extension does not delete agencies created by a solution plug-in.
    The agencies or agents created by a solution such as NSX, or any other solution which uses EAM is not deleted from the database when the solution is unregistered as an extension in vCenter Server.

ESXi 6.0 Update 3 (Build 5050593)

The what’s new in ESXi is a lot more exciting than what’s new with vCenter highlighted by a new Host Client and fairly significant improvements in vSAN performance along with similar TLS changes that are included in the vCenter update 3. With regards to the Host Client the version is now 1.14.0. and includes bug fixes and brings it closer to the functionality provided by the vSphere Client. It’s also worth mentioning that new versions of the Host Client continue to be released through the VMware Labs Flings site. but, those versions are not officially supported and not recommended for production environments.

For vSAN, multiple fixes have been introduced to optimize I/O path for improved vSAN performance in All Flash and Hybrid configurations and there is a seperate VMwareKB that address the fixes here.

  • More Logs Much less Space vSAN now has efficient log management strategies that allows more logging to be packed per byte of storage. This prevents the log from reaching its assigned limit too fast and too frequently. It also provides enough time for vSAN to process the log entries before it reaches it’s assigned limit thereby avoiding unnecessary I/O operations
  • Pre-emptive de-staging vSAN has built in algorithms that de-stages data on periodic basis. The de-staging operations coupled with efficient log management significantly improves performance for large file deletes including performance for write intensive workloads
  • Checksum  Improvements vSAN has several enhancements that made the checksum code path more efficient. These changes are expected to be extremely beneficial and make a significant impact on all flash configurations, as there is no additional read cache look up. These enhancements are expected to provide significant performance benefits for both sequential and random workloads.

As with vCenter, I’ve gone through and picked out the most significant bug fixes as they relate to Service Providers. The first one listed below is important to think about as it should significantly reduce the number of failures that people have been seeing with ESXi installed on SD-Flash Card and not just for VDI environments as the release notes suggest.

  • High read load of VMware Tools ISO images might cause corruption of flash media  In VDI environment, the high read load of the VMware Tools images can result in corruption of the flash media.
    You can copy all the VMware Tools data into its own ramdisk. As a result, the data can be read from the flash media only once per boot. All other reads will go to the ramdisk. vCenter Server Agent (vpxa) accesses this data through the /vmimages directory which has symlinks that point to productLocker.
  • ESXi 6.x hosts stop responding after running for 85 days
    When this problem occurs, the /var/log/vmkernel log file displays entries similar to the followingARP request packets might drop.
  • ARP request packets between two VMs might be dropped if one VM is configured with guest VLAN tagging and the other VM is configured with virtual switch VLAN tagging, and VLAN offload is turned off on the VMs.
  • Physical switch flooded with RARP packets when using Citrix VDI PXE boot
    When you boot a virtual machine for Citrix VDI, the physical switch is flooded with RARP packets (over 1000) which might cause network connections to drop and a momentary outage. This release provides an advanced option /Net/NetSendRARPOnPortEnablement. You need to set the value for /Net/NetSendRARPOnPortEnablementto 0 to resolve this issue.
  • Snapshot creation task cancellation for Virtual Volumes might result in data loss
    Attempts to cancel snapshot creation for a VM whose VMDKs are on Virtual Volumes datastores might result in virtual disks not getting rolled back properly and consequent data loss. This situation occurs when a VM has multiple VMDKs with the same name and these come from different Virtual Volumes datastores.
  • VMDK does not roll back properly when snapshot creation fails for Virtual Volumes VMs
    When snapshot creation attempts for a Virtual Volumes VM fail, the VMDK is tied to an incorrect data Virtual Volume. The issue occurs only when the VMDK for the Virtual Volumes VM comes from multiple Virtual Volumes datastores.
  • ESXi host fails with a purple diagnostic screen due to path claiming conflicts
    An ESXi host displays a purple diagnostic screen when it encounters a device that is registered, but whose paths are claimed by a two multipath plugins, for example EMC PowerPath and the Native Multipathing Plugin (NMP). This type of conflict occurs when a plugin claim rule fails to claim the path and NMP claims the path by default. NMP tries to register the device but because the device is already registered by the other plugin, a race condition occurs and triggers an ESXi host failure.
  • ESXi host fails with a purple diagnostic screen due to path claiming conflicts
    An ESXi host displays a purple diagnostic screen when it encounters a device that is registered, but whose paths are claimed by a two multipath plugins, for example EMC PowerPath and the Native Multipathing Plugin (NMP). This type of conflict occurs when a plugin claim rule fails to claim the path and NMP claims the path by default. NMP tries to register the device but because the device is already registered by the other plugin, a race condition occurs and triggers an ESXi host failure.
  • ESXi host fails to rejoin VMware Virtual SAN cluster after a reboot
    Attempts to rejoin the VMware Virtual SAN cluster manually after a reboot might fail with the following error:
    Failed to join the host in VSAN cluster (Failed to start vsantraced (return code 2)
  • Virtual SAN Disk Rebalance task halts at 5% for more than 24 hours
    The Virtual SAN Health Service reports Virtual SAN Disk Balance warnings in the vSphere Web Client. When you click Rebalance disks, the task appears to halt at 5% for more than 24 hours.

It’s also worth reading through the Known Issues section as there is a fair bit to be aware of especially if running NFS 4.1 and worth looking through the general storage issues.

Happy upgrading!

References:

http://pubs.vmware.com/Release_Notes/en/vsphere/60/vsphere-vcenter-server-60u3-release-notes.html

http://pubs.vmware.com/Release_Notes/en/vsphere/60/vsphere-esxi-60u3-release-notes.html

https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2149127

vSphere 6.5 – Whats in it for Service Providers Part 1

Last week after an extended period of development and beta testing VMware released vSphere 6.5. This is a lot more than a point release and is a major major upgrade from vSphere 6.0. In fact, there is so much packed into this new release that there is an official whitepaper listing all the features and enhancements that had been linked from the release notes.  I thought I would go through some of the key features and enhancements that are included in the latest versions of vCenter and ESXi and as per usual I’ll go through those improvements that relate back to the Service Providers that use vSphere as the foundation of their Managed or Infrastructure as a Service offerings.

Generally the “whats new” would fit into one post, however having gotten through just the vCenter features it became apparent that this would have to be a multi-post series…this is great news for vCloud Air Network Service Providers out there as it means there is a lot packed in for IaaS and MSPs to take advantage of.

With that, in this post will cover the following:

  • vCenter 6.5 New Features
  • vCD and NSX Compatibility
  • Current Known Issues

vCenter 6.5 New Features:

Without question the enhancements to the VCSA stand out as one of the biggest features of 6.5 and as mentioned in the whitepaper, the installer process has been overhauled and is a much smoother, streamlined experience than with previous versions. It’s also supported across more operating systems and the 6.5 version of vCenter now surpasses the Windows version offering the migration tool, native high availability and built in backup and restore. One interesting sidenote to the new VCSA is that the HTML5 vSphere Client has shipped, though it’s still very much a work in progress as a lot of unsupported functionality mentioned in the release notes…there is lots of work to do to bring it up to parity with the Flex Web Client.

In terms of the inbuilt PostGreSQL database I think it’s time that Service Providers feel confident in making the switch away from MSSQL (which was the norm with Windows based vCenters) as the enhanced VCSA Management Interface (found on port 5480) has a new monitoring screen showing information relating to disk space usage and also provides a way to gracefully start and stop the database engine.

Other vCenter enhancements that Service Providers will make use of is the High availability feature which is something a lot of people have been asking for a long time. For me, I always dealt with the no HA constraint in that vCenter may become unavailable for 5-10 minutes during maintenance or at worse an extended outage while recovering from a VM or OS level failure. Knowing that hosts and VMs are still working and responding with vCenter down leaving only core management functionality unavailable it was a risk myself and others were willing to take. However, in this day of the always on datacenter it’s expected that management functionality be as available at IaaS services…so with that, this HA feature is well welcomed for Service Providers.

This native HA solution is available exclusively for the VCSA and the solution consists of active, passive, and witness nodes that are cloned from the existing vCenter Server instance. The HA cluster can be enabled, disabled, or destroyed at any time. There is also a maintenance mode that prevents planned maintenance from causing an unwanted failover.

The VCSA Migration Tool that was previously released in 6.0 Update 2m is shipped in the VCSA ISO and can be used to migrate from Windows based 5.5 vCenter’s to the 6.5 VCSA. Again this is something that more and more service providers will take advantage of as the reliance on Windows based vCenters and MSSQL becomes more and more something that’s unwanted from a manageability and cost point of view. Throw in the enhanced features that have only been released for the VCSA and this is a migration that all service providers should be planning.

To complete the move away from any Windows based dependencies the vSphere Update Manager has also been fully integrated into the VCSA. VUM is now fully integrated into the Web Client UI and is enabled by default. For larger environments with a large numbers of hosts AutoDeploy is now fully manageable from the VCSA UI and doesn’t require PowerCLI to manage or configure it’s options. There is a new image builder included in the UI that can hit local or public repositories to pull images or drivers and there are performance enhancements during deployments of ESXi images to hosts.

vCD and NSX Compatibility:

Shifting from new features and enhancements to an important subject to talk about when talking service provider platform…VMware product compatibility. For those vCAN Service Providers running a Hybrid Cloud you should be running a combination of vCloud Director SP or/and NSX-v of which, at the moment there is no support for either in vSphere 6.5. No compatible versions of NSX are available for vSphere 6.5. If you attempt to prepare your vSphere 6.5 hosts with NSX 6.2.x, you receive an error message and cannot proceed.

I haven’t tested to see if vCloud Director SP will connect and interact with vCenter 6.5 or ESXi 6.5 however as it’s not supported I wouldn’t suggest upgrading production IaaS platforms until the interoperability matrix’s are updated.

At this stage there is no word on when either product will support vSphere 6.5 but I suspect we will see NSX-v come out with a supported build shortly…though I’m expecting vCloud Director SP to no support 6.5 until the next major version release, which is looking like the new year.

Installation and Upgrade Known Issues:

Having read through the release notes, there are also a number of known issues you should be aware of. I’ve gone through those and pulled the ones I consider the most likely to be impactful to IaaS platforms.

  • After upgrading to vCenter Server 6.5, the ESXi hosts in High Availability clusters appear as Not Ready in the VMware NSX UI
    If your vSphere environment includes NSX and clusters configured with vSphere High Availability, after you upgrade to vCenter Server 6.5, both NSX and vSphere High Availability start installing VIBs on all hosts in the clusters. This might cause installation of NSX VIBs on some hosts to fail, and you see the hosts as Not Ready in the NSX UI.
    Workaround: Use the NSX UI to reinstall the VIBs.
  • Error 400 during attempt to log in to vCenter Server from the vSphere Web Client
    You log in to vCenter Server from the vSphere Web Client and log out. If, after 8 hours or more, you attempt to log in from the same browser tab, the following error results.
    400 An Error occurred from SSO. urn:oasis:names:tc:SAML:2.0:status:Requester, sub status:nullWorkaround: Close the browser or the browser tab and log in again.
  • Using storage rescan in environments with the large number of LUNs might cause unpredictable problems
    Storage rescan is an IO intensive operation. If you run it while performing other datastore management operation, such as creating or extending a datastore, you might experience delays and other problems. Problems are likely to occur in environments with the large number of LUNs, up to 1024, that are supported in the vSphere 6.5 release.Workaround: Typically, storage rescans that your hosts periodically perform are sufficient. You are not required to rescan storage when you perform the general datastore management tasks. Run storage rescans only when absolutely necessary, especially when your deployments include a large set of LUNs.
  • In vSphere 6.5, the name assigned to the iSCSI software adapter is different from the earlier releases
    After you upgrade to the vSphere 6.5 release, the name of the existing software iSCSI adapter, vmhbaXX, changes. This change affects any scripts that use hard-coded values for the name of the adapter. Because VMware does not guarantee that the adapter name remains the same across releases, you should not hard code the name in the scripts. The name change does not affect the behavior of the iSCSI software adapter.Workaround: None.
  • The bnx2x inbox driver that supports the QLogic NetXtreme II Network/iSCSI/FCoE adapter might cause problems in your ESXi environment
    Problems and errors occur when you disable or enable VMkernel ports and change the failover order of NICs for your iSCSI network setup.Workaround: Replace the bnx2x driver with an asynchronous driver. For information, see the VMware Web site.
  • When you use the Dell lsi_mr3 driver version 6.903.85.00-1OEM.600.0.0.2768847, you might encounter errors
    If you use the Dell lsi_mr3 asynchronous driver version 6.903.85.00-1OEM.600.0.0.2768847, the VMkernel logs might display the following message ScsiCore: 1806: Invalid sense buffer.Workaround: Replace the driver with the vSphere 6.5 inbox driver or an asynchronous driver from Broadcom.
  • Storage I/O Control settings are not honored per VMDK
    Storage I/O Control settings are not honored on a per VMDK basis. The VMDK settings are honored at the virtual machine level.Workaround: None.
  • Cannot create or clone a virtual machine on a SDRS-disabled datastore cluster
    This issue occurs when you select a datastore that is part of a SDRS-disabled datastore cluster in any of the New Virtual Machine, Clone Virtual Machine (to virtual machine or to template), or Deploy From Template wizards. When you arrive at the the Ready to Complete page and click Finish, the wizard remains open and nothing appears to occur. The Datastore value status for the virtual machine might display “Getting data…” and does not change.Workaround: Use the vSphere Web Client for placing virtual machines on SDRS-disabled datastore clusters.

These are just a few, that I have singled out…it’s worth reading through all the known issues just in case there are any specific issues that might impact you.

In the next post in this vSphere 6.5 for Service Providers series I will cover, more vCenter features as well as ESXi enhancements and what’s new in Core Storage.

References:

http://pubs.vmware.com/Release_Notes/en/vsphere/65/vsphere-esxi-vcenter-server-65-release-notes.html

http://www.vmware.com/content/dam/digitalmarketing/vmware/en/pdf/whitepaper/vsphere/vmw-white-paper-vsphr-whats-new-6-5.pdf

http://pubs.vmware.com/Release_Notes/en/vsphere/65/vsphere-client-65-html5-functionality-support.html

vSphere 6 Update 2 – Whats In It for Service Providers

It’s been just over a week since VMware released vSphere 6 Update 2 and I thought I would go through some of the key features and fixes that are included in the latest versions of vCenter and ESXi. As usual I generally keep an eye out for improvements that relate back to Service Providers who use vSphere as the foundation of their Managed or Infrastructure as as Service offerings.

New Features:

Without question the biggest new feature is the release of VSAN 6.2. I’ve covered this release in previous blog posts and when you upgrade to ESXi 6.0 Update 2 the VSAN 6.2 bits are present within the kernel. Some VSAN services are actually in play regardless if you have it configured or not…which is interesting. With the new pricing for VSAN through the vCAN program, Service Providers now can seriously think about deploying VSAN for their main IaaS platforms.

The addition of support for High Speed Ethernet Links is significant not only because of the addition of 25G and 50G link speeds means increased throughput for converged network cards allowing for more network traffic to flow through hosts and switches for Fault Tolerance, vMotion, Storage vMotions and storage traffic but also because it allows SPs to think about building Edge Clusters for networking services such as NSX and allow the line speeds to take advantage of even higher backends.

From a manageability point of view the Host Client HTML5 user interface is a welcome addition and hopefully paves the way for more HTML5 management goodness from VMware for not only hosts…but also vCenter its self. There is a fair bit of power already in the Host Client and I can bet that admins will start to use it more and more as it continues to evolve.

For vCenter the addition of Two-Factor Authentication using RSA or Smartcard technology is an important feature for SPs to use if they are considering any sort of certification for their services. For example many government based certifications such as IRAP require this to be certified.

Resolved Issues:

There are a bunch of resolved issues in this build and I’ve gone through the rather extensive list to pull out the biggest fixes that relate to my experience in service provider operations.

vCenter:

  • Upgrading vCenter Server from 5.5 Update 3b to 6.0 Update 1b might fail if SSLv3 is disabled on port 7444 of vCenter Server 5.5 Update 3b. An upgrade from vCenter Server 5.5 Update 3b to 6.0 Update 2 works fine if SSLv3 is disabled by default on 7444 port of vCenter Server 5.5 Update 3b.
  • Deploying a vApp on vCloud Director through the vApp template fails with a Profile-Driven storage error. When you refresh the storage policy, an error message similar to the following is displayed: The entity vCenter Server is busy completing an operation.
  • Delta disk names of the source VM are retained in the disk names of the cloned VM. When you create a hot clone of a VM that has one or more snapshots, the delta disk names of the source VM are retained in the cloned VM
  • vCenter Server service (vpxd) might fail during a virtual machine power on operation in a Distributed Resource Scheduler (DRS) cluster.

ESXi:

  • Hostd might stop responding when you execute esxcli commands using PowerCLI resulting in memory leaks and memory consumption exceeding the hard limit.
  • ESXi mClock I/O scheduler does not work as expected. The ESXi mClock I/O scheduler does not limit the I/Os with a lesser load even after you change the IOPS of the VM using the vSphere Client.
  • After you upgrade Virtual SAN environment to ESXi 6.0 Update 1b, the vCenter Server reports a false warning similar to the following in the Summary tab in the vSphere Web Client and the ESXi host shows a notification triangle
  • Attempts to perform vMotion might fail after you upgrade from ESXi 5.0 or 5.1 to 6.0 Update 1 or later releases. An error message similar to the following is written to the vmware.log file.
  • Virtual machine performance metrics are not displayed correctly as the performance counter cpu.system.summation for a VM is always displayed as 0
  • Attempts to perform vMotion with ESXi 6.0 virtual machines that have two 2 TB virtual disks created on ESXi 5.0 fail with an error messages similar to the following logged in the vpxd.log file:2015-09-28T10:00:28.721+09:00 info vpxd[xxxxx] [[email protected] sub=vpxLro opID=xxxxxxxx-xxxxxxxx-xx] [VpxLRO] — BEGIN task-919 — vm-281 — vim.VirtualMachine.relocate — xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx(xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx)

The mClock fix highlighted above is a significant fix for those that where looking to use IOPS limiting. It’s basically been broken since 5.5 Update 2 and also impacts/changes the way in which you would think IOPS are interpreted through the VM to storage stack. For service providers looking to introduce IOPS limited to control the impact noisy neighbors the fix is welcomed.

As usual there are still a lot of known issues and some that have been added or updated to the release notes since release date. Overall the early noise coming out from the community is that this Update 2 release is relatively solid and there have been improvements in network performance and general overall stability. Hopefully we don’t see a repeat of the 5.5 Update 2 issues or the more recent bug problems that have plagued previous released…and hopefully not more CBT issues!

vSphere 6.0 Update 2 has a lot of goodness for Service Providers and continues of offer the number one vitalization platform from which to build managed and hosted services on top of. Go grab it now and put it through it’s paces before pushing to production!

References:

http://pubs.vmware.com/Release_Notes/en/vsphere/60/vsphere-esxi-60u2-release-notes.html

http://pubs.vmware.com/Release_Notes/en/vsphere/60/vsphere-vcenter-server-60u2-release-notes.html

http://pubs.vmware.com/Release_Notes/en/vsan/62/vmware-virtual-san-62-release-notes.html

http://pubs.vmware.com/Release_Notes/en/vsphere/60/vmware-host-client-10-release-notes.html

Heads Up: Heavy VXLAN Traffic Causing Broadcom 10GB NICS to Drop

For the last couple of weeks we have had some intermittent issues where by ESXi network adapters have gone into a disconnected state requiring a host reboot to bring the link back online. Generally it was only one NIC at a time, but in some circumstances both NICs went offline resulting in host failure and VM HA events being triggered. From the console ESXi appears to be up, but each NIC was listed as disconnected and when we checked the switch ports there was no indication of a loss of link.

In the vmkernal logs the following entries are observed:

After some time working with VMware Support our Ops Engineer @santinidaniel came aross this VMwareKB which described the situation we where seeing. Interestingly enough we only saw this happening after recent host updates to ESXi 5.5 Update 3 builds but as the issue is listed as being present in ESXi 5, 5.5 and 6.0 that might just be a side note.

The cause as listed in the KB is:

This issue occurs when the guest virtual machine sends invalid metadata for TSO packets. The packet length is less than Maximum Segment Size (MSS), but the TSO bit is set. This causes the adapter and driver to go into a non-operational state.

Note: This issue occurs only with VXLAN configured and when there is heavy VXLAN traffic.

It just so happened that we did indeed have a large customer with high use Citrix Terminal Servers using our NSX Advanced Networking…and they where sitting on a VXLAN Virtualwire. The symptoms got worse today that coincided with the first official day of work for the new year.

There is a simple workaround:

That command has been described in blog posts relating to the Broadcom (which now present as QLogic drivers) drivers and where previously there was no resolution, there is now a fix in place by upgrading to the latest drivers here. Without upgrading to the latest certified drivers the quickest way to avoid the issue is to apply the workaround and reboot the host.

There has been recent outcry bemoaning the lack of QA with some of VMware’s latest releases but the reality is the more bits you add the more likelihood there is for issues to pop up…This is becoming more the case with ESXi as the base virtualization platform continues to add to it’s feature set which now includes VSAN baked in. Host extensions further add to the chance of things going wrong due to situations that are hard to test in as part of the QA process.

Deal, fix…and move on!

References:

http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2114957

https://my.vmware.com/web/vmware/details?downloadGroup=DT-ESXI55-QLOGIC-BNX2X-271250V556&productId=353

 

vSphere 5.5 Update 3 Released: Features and Top Fixes

vSphere 5.5 Update 3 was released earlier today and there are a bunch of bug fixes and feature improvements in this update release for both vCenter and ESXi. For most Service Providers updating to vSphere 6.0 is still a while away so it’s good to have continued support and improvement for the 5.5 platform. I’ve scanned through the release notes and picked out what I consider some of the more important bug fixes and resolved issues as they pertain to my deployments of vSphere.

Note: Still appears that there is no resolution to the vMotion errors I reported on earlier in the year or the bugs around the mClock Scheduler and IOPS Limiter on NFS.

ESXi 5.5 Update 3:

  • Status of some disks might be displayed as UNCONFIGURED GOOD instead of ONLINEStatus of some disks on an ESXi 5.5 host might be displayed as UNCONFIGURED GOOD instead of ONLINE. This issue occurs for LSI controller using the LSI CIM provider.
  • Cloning CBT-enabled virtual machine templates from ESXi hosts might failAttempt to clone CBT-enabled virtual machines templates simultaneously from two different ESXi 5.5 hosts might fail. An error message similar to the following is displayed:Failed to open VM_template.vmdk': Could not open/create change tracking file (2108).
  • ESXi hosts with the virtual machines having e1000 or e1000e vNIC driver might fail with a purple screenESXi hosts with the virtual machines having e1000 or e1000e vNIC driver might fail with a purple screen when you enable TCP segmentation Offload (TSO). Error messages similar to the following might be written to the log files:cpu7:nnnnnn)Code start: 0xnnnnnnnnnnnn VMK uptime: 9:21:12:17.991 cpu7:nnnnnn)0xnnnnnnnnnnnn:[0xnnnnnnnnnnnn][email protected]#nover+0x65b stack: 0xnnnnnnnnnn cpu7:nnnnnn)0xnnnnnnnnnnnn:[0xnnnnnnnnnnnn][email protected]#nover+0x18ab stack: 0xnnnnnnnnnnnn cpu7:nnnnnn)0xnnnnnnnnnnnn:[0xnnnnnnnnnnnn][email protected]#nover+0xa2 stack: 0xnnnnnnnnnnnn cpu7:nnnnnn)0xnnnnnnnnnnnn:[0xnnnnnnnnnnnn][email protected]#nover+0xae stack: 0xnnnnnnnnnnnn cpu7:nnnnnn)0xnnnnnnnnnnnn:[0xnnnnnnnnnnnn][email protected]#nover+0x488 stack: 0xnnnnnnnnnnnn cpu7:nnnnnn)0xnnnnnnnnnnnn:[0xnnnnnnnnnnnn][email protected]#nover+0x60 stack: 0xnnnnnnnnnnnnnnn cpu7:nnnnnn)0xnnnnnnnnnnnn:[0xnnnnnnnnnnnn][email protected]#nover+0x185 stack: 0xnnnnnnnnnnnn
  • Attempts to reboot Windows 8 and Windows 2012 server on ESXi host virtual machines might failAfter you reboot, the Windows 8 and Windows 2012 Server virtual machines might become unresponsive when the Microsoft Windows boot splash screen appears. For more information refer, Knowledge Base article 2092807.
  • Attempts to install or upgrade VMware Tools on a Solaris 10 Update 3 virtual machine might fail
    Attempts to install or upgrade VMware Tools on a Solaris 10 Update 3 virtual machine might fail with the following error message:Detected X version 6.9
    Could not read /usr/lib/vmware-tools/configurator/XOrg/7.0/vmwlegacy_drv.so Execution aborted.This issue occurs if the vmware-config-tools.pl script copies the vmwlegacy_drv.so file, which should not be used in Xorg 6.9.

In going through the remaining Known Issues you come across a lot of Flash Read Cache related problems…maybe VMware should call it a day with this feature…not sure if anyone has the balls to actually use it in production…be interested to hear? There are also a lot of VSAN issues still being reported as known with workarounds in place…all the more reason to start a VSAN journey with vSphere 6.0.

For a look at what’s new and for the release notes in full…click on the links below:

VMware ESXi™ 5.5 Update 3 | 16 SEP 2015 | Build 3029944

vCenter Server 5.5 Update 3 | 16 SEP 2015 | Build 3000241

Leap Second Bug: Worth a Double Check…

In 2008 I vividly remember the impact that leap year/day/seconds can have on systems that are not prepared to handle the changes in time or date. It was the 29th of February and at the time I was working for a Service Provider offering Hosted Exchange services based on Exchange 2007. All off a sudden my provisioning scripts stopped working and we could not add, remove or modify Exchange Mailboxes.

After a day of frustration working with MS Support and dreading a full system rebuild the problem seemed to disappear the following day…the 1st of March. At the end of the day and after a couple of days of Microsoft scratching their head the Exchange Engineering team realised that they hadn’t allowed for the leap day somewhere deep in the bowls of their code which resulted in all account modifications not working during the 24 hours of the leap day.

Fast forward five years and the Earth’s rotation continues to slow and we have a situation where system administrators and operations teams need to be aware of another out of the norm situation that could affect systems and platforms. This time it’s due to a leap second adjustment which is scheduled for 30th of June 2015 at 23:59:60 UTC and it may cause issues for devices and operating systems that are NTP synchronised. Older Linux kernels seem to be the most affected by leap second with most vendors releasing KB articles regarding the leap second impact and how to work around it.

While this is not something that will bring down the internet it’s still something that all infrastructure IT professionals should be aware of and be double checking all systems to ensure there are no embarrassing time related incidents come the 30th of June.

ESXi and Other VMware Products:

As per this KB, ESXi is not impacted by the leap second bug…but other appliance based solutions (mostly SUSE based) look to require the enabling of Slew Mode for NTP.

ESX/ESXi utilizes the RFC-1589 clock model, appropriately handling leap seconds.

It is not necessary to enable Slew Mode for NTP in ESX/ESXi’s NTP client, or to otherwise work around leap seconds by disabling and re-enabling the NTP client before and after the leap second’s occurrence. For more information, see Enabling Slew Mode for NTP (2121016).

However, while ESX/ESXi server is not expected to experience negative impact from a leap second taking place, it remains possible for Guest Operating Systems and/or running applications to experience an impact, independent of ESX/ESXi, if it is not designed to handle one. VMware recommends customers to test their complete solutions.

This KB lists all the affected platforms and the suggested fixes for them. For vCloud SPs running vCloud Director… as most Cells run off Red Hat Enterprise there should be no impact, however it’s worth double checking as time skew is the number one enemy of vCloud Director IaaS platforms.
.
Service Providers:
While most Cloud providers don’t manage client Operating Systems directly it would be a good move to put out some form of advisory so that clients protect their VMs before the leap second hits…if not there could be a lot of angry service desk calls relating to increased and unexplained CPU usage, application slowdowns, application crashes, and failures on startup.
.

Read more

Quick Post: One Stop Shop for ESXi Driver Downloads

Today I needed to update an Emulex NIC Driver for an new host that I installed using the VMware ESXi 5.5 Update 2 base image. I needed to chase up the latest OEM update bundle for the elxnet drivers… Generally sourcing these driver bundles can be a bit of a pain but I remembered a conversation I had last week with @dmanconi where he gave me a hot tip on a location “hidden in plain sight” where you have access to all the latest driver bundle updates for what appears to be the most common network and storage adapters for ESXi.

This is located under the Horizon View 5.x Download Page on the MyVMware Website under the Drivers & Tools Tab.

https://my.vmware.com/web/vmware/info/slug/desktop_end_user_computing/vmware_horizon_view/5_3#drivers_tools

As you can see above, the release dates for the drivers are as recent as a couple of days ago and the list is extensive. As usual the recommendation where possible is that you use specific Vendor release drivers for your production systems, but otherwise…they are all here at your fingertips!

Enjoy.

January vCenter and ESXi 5.5 Patch Releases: Critical CBT Bug Fixed

VMware released new builds for vCenter and ESXi 5.5 today. The builds contain mostly bug fixes, but I wanted to point out one fix that had affected those who use CBT as part of their VM backup strategy. Veeam users where initially affected by the bug…though Veeam released a work around in subsequent Veeam 8 builds this ESXi patch officially fixes the issue.

When you use backup software that uses the Virtual Disk Development Kit (VDDK) API call QueryChangedDiskAreas(), the list of allocated disk sectors returned might be incorrect and incremental backups might appear to be corrupt or missing. A message similar to the following is written to vmware.log:

DISKLIB-CTK: Resized change tracking block size from XXX to YYY

For more information, see KB 2090639.

There are also a couple of other interesting fixes in the build:

  • Storage vMotion of thin provisioned virtual machine disk (VMDK) takes longer than the Lazy Zeroed Thick (LZT) disk.
  • When a quiesced snapshot is created on a Windows Server 2003, Windows Server 2003 R2, Windows Server 2008, Windows Server 2008 R2, or a Windows Server 2012 virtual machine, duplicate disks might be created in the virtual machine.
  • On an ESXi 5.5 host, the NFS volumes might not restore after the host reboots. This issue occurs if there is a delay in resolving the NFS server host name to IP address.

For those who backup VMs with disks larger that 128GB I would be looking to deploy this patch ASAP.

ESXi 5.5: IOPS Limit and mClock scheduler

It’s fair to say that it’s not very often the difference between a 1 and 0 can have such a massive impact on performance…I had read a couple of posts from Duncan Epping in regards to a new DiskIO Schedular introduced in ESXi 5.5 and what it meant for VM Disk Limits compared to how things where done in previous 5.x versions of ESXi.

As mentioned in the posts linked above the ESxi Host Advanced Setting Disk.SchedulerWithReservation is activated by default in ESXi 5.5, so without warning in an upgrade scenario where IOPS limits are used the rules change…and from I found out last night the results are not pretty.

The graphic above is showing the total latency of a VM (On an NFS datastore with an IOPS limit set) running firstly on ESXi 5.1 up until the first arrow when the VM was vMotioned to an in place upgraded host now running ESXi 5.5 Update 2. As soon as the VM lands on the 5.5 host latency skyrockets and remains high until there is another vMotion to another 5.5 host. Interestingly enough the overall latency on the other host isn’t as high as the first but grows in a linear fashion.

The second arrow represents the removal of the IOPS Limits at which point latency falls to its lowest levels in the past 24 hours. The final arrows represents a test where the IOPS Limits was reapplied and the advanced setting Disk.SchedulerWithReservation was set to 0 on the host.

To make sense of the above it seems that the mClock Scheduler being on caused the applied IOPS Limit to generate artificial latency against the VM rendering it pretty much useless. The reasons for this are unknown at this point, but in reading Duncan’s blog on the IOPS Limit Caveat it would seem that due to the new 5.5 mClock Scheduler looking at application level block sizes to work out the IO the applied limit actually crippled the VM. In this case the limit was set to 250 IOPS, so I am wondering if there is some unexpected behaviour happening here even if larger block sizes are being used in the VM.

Suffice to say it looks like the smart thing to be doing is set the Disk.SchedulerWithReservation to 0 and revert back to 5.0/1 behaviour with IOPS Limits in place. If you want to do that on bulk the following PowerCLI command will do the trick.

One more interesting observation I made is that it appears VMs with IOPS limits on iSCSI datastores are not/less effected…however most large NFS datastores with a large number of VMs where. You can see below what happens to datastore latency when I switched off the mClock Scheduler…latency dropped instantly.

I’m not sure if this indicates more general NFS issues with ESXi 5.5…there seems to have been more than a few since Update 1 came out. I’ve reached out to see if this behaviour can be explained so hopefully I can provide an update when that information comes to hand…again, I’m not too sure what the benefit of this change in behaviour is so I’m hoping someone who has managed to digest the Brain Hurting Academic Paper to explain why this was introduced.

In laymen’s terms…an IO is not an IO in the world of VM IOPS Limits anymore.

 

« Older Entries