ESXi 5.5 Update 2: vMotion Fails at 14% with Stale Admission Control and VM Reservations

We are currently in the process of upgrading all of our vCenter Clusters from ESXi 5.1 to 5.5 Update 2 and have come across a bug whereby the vMotion of VMs from the 5.1 hosts to the 5.5 hosts fails at 14% with the following error:

[UPDATE 6/10] –

Observed Conditions:

  • vCenter 5.5 U1/2 (U2 resulted in less 14% stalls, but still occurring)
  • Mixed Cluster of ESXi 5.1 and 5.5 Hosts (U1 or U2)
  • Has been observed happening in fully upgraded 5.5 U2 Cluster
  • VMs have various vCPU and vRAM configuration
  • VMs have vRAM Reservations Unlimited vRAM/vCPU
  • VMs are vCD Managed

Observed Workarounds:

  • Restart Management Agents on vMotion Destination Host (hit + miss)
  • vMotion VM to 5.1 Host if available
  • Remove vRAM Reservation and Change to Unlimited vCPU/vRAM
  • Stop and start VM on different host (not ideal)

We are running vCenter 5.5 Update 1 with an number of  Clusters that where on ESXi 5.1 of which some act as Provider vDCs for vCloud Director. Upgrading the Clusters which are not vCloud Providers (meaning VMs aren’t vCD managed or have vCD reservations applied) didn’t result in the issue and we where able to upgrade all hosts to ESXi 5.5 Update two without issue.

There seemed to be no specific setting or configuration of the VMs that ultimatly got stuck during a vMotion from a 5.1 to 5.5 host however they all have memory reservations of various sizes based on our vCloud Allocation Pool settings.

Looking through the host.d logs on the 5.5 Host acting as the destination for the vMotion we see the following entry:

Some of the key entries where

and
After trying to scour the web for guidance we came across this VMwareKB which listed the error specifically…but specifies VMs already powered off and not ones powered on and under vMotion…in any case the resolutions that are suggested not viable in our environment…namely the stopping and starting of VMs and rejigging of the Memory Reservation settings.
We engaged VMware Support and raised a Support Request…after a little time working with support it was discovered that this may be an internal bug which first appeared as fixed in ESXi 5.5 Extras Patch 4 but appears to have slipped past the Update 2 release. The bug relates to stale Admission Control Constraints and the workaround suggested was to restart the management agents of the destination 5.5 host…in addition to that if this was occurring in a vCloud Provider Cluster it was suggested that the Host be disabled from within vCloud Director and then trigger the Redeploy All VMs action as shown below
This effectively triggers a Maintenance Mode task in vCenter against the host, which for mine is no different to triggering a Maintenance Mode from vCenter directly…however the specific workaround in vCD environments was to use this method in conjunction with the destination Host management agent restart.
Results have been mixed up to this point and we have faced VMs that simply wont vMotion at all even after trying every combination stated above. Our own workaround has been to shuffle those VMs to other 5.1 hosts in the cluster and hope that they will vMotion to a 5.5 host as we roll through the rest of the upgrades. Interestingly enough we have also seen random behaviour where if, for example 6 VMs are stuck at 14%…after a agent restart only 4 of them might be affected and in subsequent restarts it might only be 2…this just tells me that the big is fairly hit and miss and needs some more explaining as to the specific circumstance and reasoning that trigger the condition.
We are going to continue to work with VMware Support to try and get a more scientific workaround until the bug fix is released and I will update this post with the outcomes of both of those action items when they become available…its more of a rather inconvenient bug… but still…VM uptime is maintained and that’s all important.
If anyone has had similar issues feel free to comment:

13 comments

  • We have the same issue. Currently on 5.5U1 including vCloud Director. The workaround of restarting management agent and “Redeploy all VMs” had no change in end result. I look forward to your updates!

  • once poweroff the vm and try…its worked for me

  • Koen Verhelst

    I have the same issue. Any news from your support ticket?

    thx
    K

    • Hey there…no news on it no. GSS wanted us to turn on ridiculous amounts of logging on VC and on the Hosts …but given we run 1000’s of VMs in each zone that was not plausible…sometimes they don’t get real world operating conditions… End of the day we managed to work around it and have upgraded all hosts now…

      • Hi Anthony – we have the same issue – just wondering what version/patch you ended up upgrading to please.

        • Hey there…we actually didn’t get to resolve it all the way…having said that I hasn’t seen it happen across our clusters with DRS vMotions. Running 5.5 U2 Build 2068190

          • Ah see – we see it with DRS fully enabled. Are you running on UCS? If so what version.

  • Also have this issue, using vCloud Director. Entire environment is 5.5, only seems to be 1 VM affected, and the destination hosts have all been rebooted recently. Tried increasing the resource allocation for the vDC, but didn’t help. Did this ever have a resolution?

  • Didi anyone find a working solution other than powering down the affected vms?

    • Hey there….not specifically, but we have seen less and less instances of this happening. We are a couple builds beyond what was listed in this post. About to go to U3a.

  • Careful:
    After upgrading from ESXi 5.0/5.1 to ESXi 5.5 Update 3 or later, vMotion fails with: failed to add memory page 0x201 to VM: Bad parameter (2143943)

    http://kb.vmware.com/kb/2143943