Tag Archives: Bug

Important ESXi 6.0 Patch – Upgrade Now!

Last week VMware released a new patch (ESXi 6.0 Build 5572656) that addresses a number of serious bugs with Snapshot operations. Usually I wouldn’t blog about a patch release, but when I looked through the rest of the fixes in the VMwareKB it was apparent to me that this was more than your average VMware patch and addresses a number of issues around storage but again, a lot around Snapshot operations which is so critical to most VM backup operations.

Vote for your favorite blogs at vSphere-land!

Top vBlog Voting 2017

Here are some of the key resolutions that I’ve picked out from the patch release:

  • When you take a snapshot of a virtual machine, the virtual machine might become unresponsive
  • After you create a virtual machine snapshot of a SEsparse format, you might hit a rare race condition if there are significant but varying write IOPS to the snapshot. This race condition might make the ESXi host stop responding
  • Because of a memory leak, the hostd process might crash with the following error: Memory exceeds hard limit. Panic. The hostd logs report numerous errors such as Unable to build Durable Name. This kind of memory leak causes the host to get disconnected from vCenter Server
  • Using SESparse for both creating snapshots and cloning of virtual machines, might cause a corrupted Guest OS file system
  • During snapshot consolidation a precise calculation might be performed to determine the storage space required to perform the consolidation. This precise calculation can cause the virtual machine to stop responding, because it takes a long time to complete
  • Virtual Machines with SEsparse based snapshots might stop responding, during I/O operations with a specific type of I/O workload in multiple threads
  • When you reboot the ESXi host under the following conditions, the host might fail with a purple diagnostic screen and a PCPU xxx: no heartbeat error.
    • You use the vSphere Network Appliance (DVFilter) in an NSX environment
    • You migrate a virtual machine with vMotion under DVFilter control
  • Windows 2012 domain controller supports SMBv2, whereas Likewise stack on ESXi supports only SMBv1. With this release, the likewise stack on ESXi is enabled to support SMBv2
  • When the unmap commands fail, the ESXi host might stop responding due to a memory leak in the failure path. You might receive the following error message in the vmkernel.log file: FSDisk: 300: Issue of delete blocks failed [sync:0] and the host gets unresponsive.
  • In case you use SEsparse and enable unmapping operation to create snapshots and clones of virtual machines, after the wipe operation (the storage unmapping) is completed, the file system of the guest OS might be corrupt. The full clone of the virtual machine performs well.

There is also a number of vSAN related fixes in the patch so overall it’s worth looking to apply this patch as soon as is possible.

References:

https://kb.vmware.com/kb/2149955

NSX Bytes: NSX-v 6.3 Host Preparation Fails with Agent VIB module not installed

NSX-v 6.3 was released last week with an impressive list of new enhancements and I wasted no time in looking to upgrades my NestedESXi lab instance from 6.2.5 to 6.3 however I ran into an issue that at first I thought was related to a previous VIB upgrade issue caused by VMware Update Manager not being available during NSX Host upgrades…in this case it presented with the same error message in the vCenter Events view:

VIB module for agent is not installed on host <hostname> (_VCNS_xxx_Cluster_VMware Network Fabri)

After ensuring that my Update Manager was in a good state I was left scratching my head…that was until some back and forth in the vExpert Slack #NSX channel relating to a new VMwareKB that was released the same day as NSX-v 6.3.

https://kb.vmware.com/kb/2053782

This issue occurs if vSphere Update Manager (VUM) is unavailable. EAM depends on VUM to approve the installation or uninstallation of VIBs to and from the ESXi host.

Even though my Upgrade Manager was available I was not able to upgrade through Host Preparation. It seem’s like vSphere 6.x instances might be impacted by this bug but the good news is there is a relatively easy workaround as mentioned in the VMwareKB that bypasses the VUM install mechanism. To enable the workaround you need to enter into the Managed Object Browser of the vCenter EAM by going to the following URL and entering in vCenter admin credentials.

https://vCenter_Server_IP/eam/mob/ 

Once logged in you are presented with a (or list of) agencies. In my case I had more than one, but I selected the first one in the list which was agency-11

The value that needs to be changed is the bypassVumEnabled boolean value as shown below.

To set that flag to True enter in the following URL:

https://vCenter_Server_IP/eam/mob/?moid=agency-x&method=Update

Making sure that the agency number matches your vCenter EAM instance. From there you need to change the existing configuration for that value by removing all the text in the value box and invoking the value listed below:

Once invoked you should be able to go back into the Web Client and click on Resolve under the Cluster name in the Host Preparation Tab of the NSX Installation window.

Once done I was in an all Green state and all hosts where upgraded to 6.3.0.5007049. Once all hosts have been upgraded it might be a useful idea to reverse the workaround and wait for an official fix from VMware.

References:

https://kb.vmware.com/kb/2053782

NSX Bytes: Important Bug in 6.2.4 to be Aware of

[UPDATE] In light of this post being quoted on The Register I wanted to clarify a couple of things. First off, as mentioned there is a fix for this issue (the KB should be rewritten to clearly state that) and secondly, if you read below, you will see that I did not state that just about anyone running NSX-v 6.2.4 will be impacted. Greenfield deployments are not impacted.

Here we go again…I thought maybe we where over these, but it looks like NSX-v 6.2.4 contains a fairly serious bug impacting VMs after vMotion operations. I had intended to write about this earlier in the week when I first became aware of the issue, however the last couple of days have gotten away from me. That said, please be aware of this issue as it will impact those who have upgraded NSX-v from 6.1.x to 6.2.4.

As the KB states, the issue appears if you have the Distributed Firewall enabled (it’s enabled and inline by default) and you have upgraded NSX-v from 6.1.x to 6.2.3 and above, though for most this should be applicable to 6.2.4 upgrades due to all this issues in 6.2.3. If VM’s are migrated between upgraded hosts they will loose network connectivity and require a reboot to bring back connectivity.

If you check the vmkernal.log file you will see similar entries to that below.

Cause

This issue occurs when the VSIP module at the kernel level does not handle the export_version deployed in NSX for vSphere 6.1.x correctly during the upgrade process.

The is no current resolution to the issue apart from the VM reboot but there is a workaround in the form of a script that can be obtained via GSS if you reference KB2146171. Hopefully there will be a proper fix in future NSX releases.

<RANT>

I can’t believe something as serious as this was missed by QA for what is VMware’s flagship product. It’s beyond me that this sort of error wasn’t picked up in testing before it was released. It’s simply not good enough that a major release goes out with this sort of bug and I don’t know how it keeps on happening. This one specifically impacted customers and for service providers or enterprises that upgraded in good faith, it puts egg of the faces of those who approve, update and execute the upgrades that results in unhappy customers or internal users.

Most organisations can’t fully replicate production situations when testing upgrades due to lack or resources or lack of real world situation testing…VMware could and should have the resources to stop these bugs leaking into release builds. For now, if possible I would suggest that people add more stringent vMotion tests as part of NSX-v lab testing before promoting into production moving forward.

VMware customers shouldn’t have to be the ones discovering these bugs!

</RANT>

[UPDATE] While I am obviously not happy about this issue coming in the wake of previous issues, I still believe in NSX and would recommend all shops looking to automate networking still have faith in what the platform offers. Bug’s will happen…I get that, but I know in the long run there is huge benefit in running NSX.

References:

https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2146171

NSX Bytes: NSX-v 6.2.4 Released …Important Upgrade!

NSX-v 6.2.4 was released the week before VMworld US so might have gotten somewhat lost in the VMworld noise…For those that where fortunate enough to not upgrade to or deploy a greenfield 6.2.3 site you can now safely do so without the nasty bugs that existed in the 6.2.3 build. In a nutshell this new build delivers all the significant features and enhancements announced in 6.2.3 without the dFW or Edge Gateway bugs that forced the build being pulled from distribution a few weeks back.

In terms of how and when to upgrade from previous versions the following table gives a great overview of the pathways required to get to 6.2.4.

The take away from the table above is that if possible you need to get onto NSX-v 6.2.4 as soon as possible and with good reason:

  • VMware NSX 6.2.4 provides critical bug fixes identified in NSX 6.2.3, and 6.2.4 delivers a security patch for CVE-2016-2079 which is a critical input validation vulnerability for sites that uses NSX SSL VPN.
  • For customers who use SSL VPN, VMware strongly recommends a review of CVE-2016-2079 and an upgrade to NSX 6.2.4.
  • For customers who have installed NSX 6.2.3 or 6.2.3a, VMware recommends installing NSX 6.2.4 to address critical bug fixes.

Prior to this release if you had upgraded to NSX-v 6.1.7 you where stuck and not able to upgrade to 6.2.3. The Upgrade matrix is now reporting that you can upgrade 6.1.7 to 6.2.4 as shown below.

I was able to validate this in my lab going from 6.1.7 to 6.2.4 without any issues.

NSX-v 6.1.4 is also fully supported by vCloud Director SP 8.0.1 and 8.10

References:

http://pubs.vmware.com/Release_Notes/en/nsx/6.2.4/releasenotes_nsx_vsphere_624.html

http://www.theregister.co.uk/2016/07/22/please_dont_upgrade_nsx_just_now_says_vmware/

Quick Post: ESXi 6.0 Patch Breaks Veeam Instant VM Recovery

This is a quick post to alert Veeam users to an issue that was raised in the Veeam Community Forums yesterday…firstly if you are a Veeam customer and are not registered for the Veeam Community Forum Digest that Anton Gostev releases every Sunday night then stop reading this and go register here! There is some awesome content that Anton covers and its not just limited to backups but general industry news and trends as well.

Once you have done that I thought I would bring to everyone’s attention an important note that Gostev mentioned in his last update relating to an issue with Veeam Instant Recovery and all dependent features when ESXi 6.0 Patch 6 (Build  3825889) is installed.

This patch was released on the 12th of May so chances are some people have deployed it and are being impacted if they use or rely on Instant Recovery. As Gostev mentions, Veeam have an ongoing support case with VMware but as is usual with Veeam they have gone ahead and got a workaround in place in the form of a hotfix which is applicable to Veeam 9.0 Update 1.

If you have deployed this ESXi 6.0 build and run Veeam contact their support to grab the hotfix. Again well done to the Veeam development teams for working around issues so efficiently.

References:

https://forums.veeam.com/ucp.php?mode=register&sid=1a1ab7f2950f864f9bd3a4e4d2f0dcce

https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2136186

ESXi Bugs – VMware Can’t Keep Letting This Happen!

VMware is at an interesting place at this point in time…there is still no doubting that ESXi and vCenter are the market leaders in terms of Hypervisor Platform and that the vCloud Suite offers a strong portfolio of management, automation and monitoring tools. However VMware has become the hunted and is suffering what most massivly successful tech companies go through after a sustained period of uninterrupted success…there are those that want to see it burn!

There are a number of competing vendors (and industry watchers) waiting to capitalize on any weakness shown in the VMware stack and with the recent number of QA issues leading to a significant bugs popping up not abating, I wonder how much longer VMware can afford to continue to slip up before it genuinely hurts its standing.

The latest couple to watch out for have become common repeat offenders since the 5.5 release…problems with vMotion, Pathing leading to PDLs/APDs and CBT issues have seemed to be on repeat if you search through the VMwareKBs over the past twelve to eighteen months.

KB2143943 – vMotion Fails After Upgrading from a number of builds
KB2144657 – ESXi 6 may not fail over correctly after encountering a PDL

As a Service Provider the CBT bugs are the most worrying because they fundamentally threaten the integrity of backup data which is not something that IT Operation staff or end users who’s data is put at risk should have to worry about. Veeam have done a great job circumventing the issue, though these issues are being fixed with drastic measures like full CBT resets…On a IaaS Platform where machines are not easily scheduled for downtime this is a massive issue.

I know that VMware are not purposely going out of their way to produce these errors, and I am sure that there are individuals and teams getting an ass whipping internally. But it has to stop…the quality of what is released to the public for consumption can’t continue to suffer from these issues. Their lead is secure for the moment and VMware have an extremely passionate and committed supporter base and even though their hypervisor competitors are not free of devastating bugs themselves (in fact ESXi was still the least patched hypervisor platform of in the last 12 months) it’s not a lead VMware can afford to let slip any more…specially with ESXi and vCenter are still at the heart of what VMware is trying achieve through new focus products like NSX and VSAN.

To be fair the VMware team do a great job and keep everyone up to date with issues as they arise and are generally fixed in quick time…VMware can’t afford to have many more:

Resolution:
This is a known issue affecting ESXi 6.0.
Currently, there is no resolution.

Especially if they are repeat bugs!

http://blogs.vmware.com/kbdigest/ 

Quick Fix: vCenter 5.5 Update 3x Phone Home Warning and VPXD Service not Starting

This week I’ve been upgrading vCenter in a couple of our labs and came across this issue during and after the upgrade of vCenter from 5.5 Update 2 to 5.5 Update 3a or 3b. During the upgrade of the vCenter the error below is thrown.

It’s an easy one to ignore as it only relates to the Phone Home Service…which to be honest I didn’t think would or was important at the time. When you click ok the installed finished as being successful, however the vCenter Service is not brought up automatically and when you go to start the service you get the following error from the services manager.

Not sure why the Googling for this particular error wasn’t as straight forward to search against but if you search to Error 1053 or Error 1053 + VMware you get referenced to some generic forum issues and this VMware KB which is a red herring in relation to this error. With that I went back to search against the Phone Home Warning 32014 and got a hit against this VMware KB which contains the exact error and reference to the deployPkg.dll that you would see in the Windows Application Event Logs when you try to start the vCenter Service.

The KB title is a little misleading in that it states

Updating vCenter Server 5.5 to Update 3 fails with the error: Warning 32014

However the fix is the right fix and after working through the work around in the KB the upgrades went through without issue and vCenter was at 5.5 Update 3b.

References:

http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2134141

http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2069296

NSX vCloud Retrofit: Controller Deployment Internal Server Error Has Occurred

During my initial work with NSX-v I was running various 6.0.x Builds together with vCD 5.5.2 and vCenter/ESXi 5.5 without issue. When NSX 6.1 was released I decided to clone off my base Production Environment to test a fresh deployment of 6.1.2 into a mature vCloud Director 5.5.2 instance that had vCNS running a VSM version of 5.5.3. When it came time to deploy the first Controller I received the following error from the Networking & Security section of the Web Client:

Looking at the vCenter Tasks and Events the NSX Manager did not even try to kick off the OFV Deployment of the Controller VM…the error it’s self wasn’t giving to much away and suspecting a GUI bug I attempted to deploy the Controller directly via the RestAPIs…this failed as well however the error returned was a little more detailed.

Looking through the NSX Manager Logs the corresponding System Log Message looked like this:

The Issue:

While the error it’s self is fairly straight forward to understand in that the a value being entered into the database table was not of the right type…the reasons behind why it had shown up in the 6.1.1 and 6.1.2 releases after having no such issue working with the 6.0.x releases stumped everyone involved in the ensuing Support Case. In fact it seemed like this was the only/first instance (at the time) of this error in all global NSX-v installs.

The Fix:

NOTE: This can only be performed by VMware Support via an SR.

The fix is a simple SQL Query to alter the KEY_VALUE_STORE referenced in the error…however this can only done done by VMware Support as it requires special access to the NSX Manager Operating System to commit the changes. A word of warning…If you happen to have the secret password to get into the back door of the NSX Manager and apply these changes…your support for NSX could become null and void!

Once that’s been committed and the NSX Manager Service restarted Controllers can be successfully deployed…again, the fix needs to be applied by VMware Support.

The RCA:

In regards to the RCA of this issue,  the customers having a long upgrade history (5.0 onwards) will hit the issue, since there was a db migration happening from 5.0 to 5.1.x upgrade the alter table script for KEY_VALUE_STORE was missing. As per VMware engineering a new upgrade on NSX Manager is not going to override the DB schema change, since there is no such script to alter table on the upgrade path.

There was no indication of this being fixed in subsequent NSX Releases and no real explanation as to why it didn’t happen in 6.0.x but that aside the fix works and can be actioned in 5 minutes.

This was a fairly unique situation that contributed to this bug being exposed…my environment was a a vCNS 5.1 -> 5.5 -> 5.5.2 -> 5.5.3.1 -> NSX 6.1 -> 6.1.1 -> 6.1.2 replica of one of our Production vCloud Zone that sits in our lab. Previously I’d been able to fully deploy NSX end to end using the same base systems which sat side by side in working order in a separate #NestedESXi Lab…but that was vCNS 5.5.2 upgraded to NSX 6.0.5 which was upgraded to 6.1.2.

So due to not too many deployments of NSX-v the issue only manifested in mature vCNS environments that where upgraded to 6.1.1 or 6.1.2. Something to look out for if you are looking at doing an NSX vCloud Retrofit. If you have a green fields site you will not come across the same issue.

Further Reading:

http://anthonyspiteri.net/nsx-bites-nsx-controller-deployment-issues/

This blog series extends my NSX Bytes Blog Posts to include a more detailed look at how to deploy NSX 6.1.x into an existing vCloud Director Environment. Initially we will be working with vCD 5.5.x which is the non SP Fork of vCD, but as soon as an upgrade path for 5.5.2 -> 5.6.x is released I’ll be including the NSX related improvements in that release.

ESXi 5.5 Update 2: vMotion Fails at 14% with Stale Admission Control and VM Reservations

We are currently in the process of upgrading all of our vCenter Clusters from ESXi 5.1 to 5.5 Update 2 and have come across a bug whereby the vMotion of VMs from the 5.1 hosts to the 5.5 hosts fails at 14% with the following error:

[UPDATE 6/10] –

Observed Conditions:

  • vCenter 5.5 U1/2 (U2 resulted in less 14% stalls, but still occurring)
  • Mixed Cluster of ESXi 5.1 and 5.5 Hosts (U1 or U2)
  • Has been observed happening in fully upgraded 5.5 U2 Cluster
  • VMs have various vCPU and vRAM configuration
  • VMs have vRAM Reservations Unlimited vRAM/vCPU
  • VMs are vCD Managed

Observed Workarounds:

  • Restart Management Agents on vMotion Destination Host (hit + miss)
  • vMotion VM to 5.1 Host if available
  • Remove vRAM Reservation and Change to Unlimited vCPU/vRAM
  • Stop and start VM on different host (not ideal)

We are running vCenter 5.5 Update 1 with an number of  Clusters that where on ESXi 5.1 of which some act as Provider vDCs for vCloud Director. Upgrading the Clusters which are not vCloud Providers (meaning VMs aren’t vCD managed or have vCD reservations applied) didn’t result in the issue and we where able to upgrade all hosts to ESXi 5.5 Update two without issue.

There seemed to be no specific setting or configuration of the VMs that ultimatly got stuck during a vMotion from a 5.1 to 5.5 host however they all have memory reservations of various sizes based on our vCloud Allocation Pool settings.

Looking through the host.d logs on the 5.5 Host acting as the destination for the vMotion we see the following entry:

Some of the key entries where

and
After trying to scour the web for guidance we came across this VMwareKB which listed the error specifically…but specifies VMs already powered off and not ones powered on and under vMotion…in any case the resolutions that are suggested not viable in our environment…namely the stopping and starting of VMs and rejigging of the Memory Reservation settings.
We engaged VMware Support and raised a Support Request…after a little time working with support it was discovered that this may be an internal bug which first appeared as fixed in ESXi 5.5 Extras Patch 4 but appears to have slipped past the Update 2 release. The bug relates to stale Admission Control Constraints and the workaround suggested was to restart the management agents of the destination 5.5 host…in addition to that if this was occurring in a vCloud Provider Cluster it was suggested that the Host be disabled from within vCloud Director and then trigger the Redeploy All VMs action as shown below
This effectively triggers a Maintenance Mode task in vCenter against the host, which for mine is no different to triggering a Maintenance Mode from vCenter directly…however the specific workaround in vCD environments was to use this method in conjunction with the destination Host management agent restart.
Results have been mixed up to this point and we have faced VMs that simply wont vMotion at all even after trying every combination stated above. Our own workaround has been to shuffle those VMs to other 5.1 hosts in the cluster and hope that they will vMotion to a 5.5 host as we roll through the rest of the upgrades. Interestingly enough we have also seen random behaviour where if, for example 6 VMs are stuck at 14%…after a agent restart only 4 of them might be affected and in subsequent restarts it might only be 2…this just tells me that the big is fairly hit and miss and needs some more explaining as to the specific circumstance and reasoning that trigger the condition.
We are going to continue to work with VMware Support to try and get a more scientific workaround until the bug fix is released and I will update this post with the outcomes of both of those action items when they become available…its more of a rather inconvenient bug… but still…VM uptime is maintained and that’s all important.
If anyone has had similar issues feel free to comment:

Follow Up: vCloud 5.x IP Sub Allocation Pool Error …Fix Coming

A few months ago I wrote a quick post on a bug that existed in vCloud Director 5.1 in regards to IP Sub Allocation Pools and IP’s being marked as in use when they should be available to allocate. What this leads to is a bunch of unusable IPs…meaning that they go to waste and pools can exhaust quicker…

  • Unused external IP addresses from sub-allocated IP pools of the gateway failed after upgrading from vCloud Director 1.5.1 to vCloud Director 5.1.2
    After upgrading vCloud Director from version 1.5.1 to version 5.1.2, attempting to remove unused external IP addresses from sub-allocated IP pools of a gateway failed saying that IPs are in use. This issue is resolved in vCloud Director 5.1.3.

This condition also presents it’s self in vCloud 5.5 environments that have 1.5 lineage. Greenfields deployments don’t seem affected…vCD 5.1.3 was suppose to contain the fix but the release notes where released in error…we where then told that the fix would come in vCD 5.5…but when we upgraded our zones we still had the issue.

We engaged VMware Support again recently and they finally have a fix for the bug due in vCD 5.5.2 (No word on those still running with 5.1.x) My suggestion for those that can’t wait for the next point release and are affected badly enough by the bug is to raise an SR and ask for the Hotfix which is an advanced build of the 5.5.2 release.

Thanking the vCloud SP Development team for their continued support of vCD #longlivevCD

« Older Entries