Monthly Archives: October 2014

vCloud Director: vCD SP 5.6.3 Release …with a catch!

“With regards to the upgrade path for SPs looking to deploy vCD SP there will be a build released approx. Q1 15 that will allow upgrades from 5.5.2”

On the 8th of October, VMware released the first vCloud Director build that was forked for the Service Provider Community. vCD for SP 5.6.3 (Build 2116763) represents a milestone for vCloud Director as a product and while Enterprises will start to move away from vCD and look to use vRealize Automation (previously vCloud Automation Center) For Service Providers the re-commitment to the Partner Network VMware in the launching of the vCloud Air Network means that VMware was not about to kill off vCD but continue to develop the platform for its own vCloud Air Services and in turn feed through feature upgrades in the form of regular vCD SP Build Updates.

As from the vCD SP 5.6.3 release, all development on the GUI has stopped and all new features are made available only through the API. As part of this release there is an extended list of SDKs added which makes developing against the platform more streamlined for SP dev teams and offers more choice. Previous to and during VMworld 2014 I had a chance to speak to the vCD Product Team and get insight into the vision for the platform moving forward. There is no doubt in my mind that VMware are serious about continuing to release features to Service Providers such as the ones mentioned at VMworld like Object Storage, DRaaS and DBaaS while there is possibility for 3rd Party ISV Solutions to be seamlessly plugged into an existing deployment.

Updates in the SP 5.6.3 GA include:

  • Virtual machine monitoring: Expose current and historical VM performance metrics to tenants through this tenant visible, multi-tenant safe API. Using the API, tenants can troubleshoot application performance problems, auto-scale their applications, and perform capacity planning.
  • VM Disk Level Storage Profiles – Allows a single virtual machine (VM) to access different tiers of storage such as storage area network (SAN), network-attached storage (NAS), and local storage to help balance storage cost vs. storage performance. VMware vCloud Director 5.6 also supports VMware Virtual SAN.
  • VMware NSX Support thru API Compatibility- Simple, secure software-defined networking using the NSX network virtualization platform in vCNS compatibility mode, in addition to the current VMware vCloud Networking and Security (vCNS) support
  • Updated Software Development Kits (SDKs) to the vCloud API – New set of Java, PHP, and .NET SDKs with documentation and samples available.

The first two represent significant enhancements to the platform and work to give tenants more visibility and choice if the Service Provider chooses to expose the features. It must be noted that the VM Disk Level Storage Profiles has a caveat in that it is obviously not exposed via the GUI, meaning SPs will need to write mechanisms for disks to be housed on separate vCenter Storage Policies or allow the tenants to make the change via PowerCLI commandlets for vCD. The downside to this is that in the dropdown menu of the VM in the GUI you will not see multiple polices and could trigger a move back to the one policy via the drop down configuration action.

The biggest feature in my opinion if the full support for NSX-v which effectively means an existing vCNS setup can be inplace upgraded to NSX which paves the way for hybrid solutions to be implemented. If you deploy a vCloud VSE once NSX-v has been deployed into an existing environment the Edge version will be at 5.5.3 and while there is no benefit in this Edge there is nothing stopping SPs from offering NSXs Enhanced 6.x Edges via VLAN backed PortGroup Direct Connect configurations.

If you are a Service Provider, vCD SP 5.6.3 is now available for Download.

So that’s the good news…the bad news is that at the moment if your vCD deployment is currently running 5.5.2 and you upgrade to 5.6.3 vCD will break!! And while that’s about all the detail I have at the moment there is apparently an issues where some patches that were included in 5.5.2 are not compatible with 5.6.3. The frustrating part for SPs at the moment is that there were some key bug fixes in 5.5.2 so I would dare say that most providers have deployed this build. I’m currently chasing up more details and will update this post as they come to hand.

In Vendors We Trust! Storage is God…

I tweeted recently that IT Professionals at some level must love pain…All through my working life I’ve dealt with outages and issues that cause myself and other around me significant amounts of agita and pain. The IT industry is unique in that we, as professionals must deal with issues that to a certain extent are caused by factors and reasons outside our control. This post is more a vent around storage more than anything else…but after the past couple of weeks felt I needed to release 🙂

With the advent of Hosting and Cloud technologies the pain once reserved for the enterprise or small business is now amplified on a much larger scale (dare I say webscale) and for those working in the Service Provider industry we now have much more responsibility to ensure continuity of an end to end, sometimes converged platform on which outages and service disruption means multiple businesses are affected at once.

Those that know me and follow my posts know that I am a fan of the saying…

Outages are like Assholes…Everybody has them!

I’ve worked in the Service Provider industry all my working life, so I have certainly had my fair share of outages…in my experience a hosting platform suffers a significant outage once every 18-24 months with the common causes of most extended outages and service degradation being network or storage.

Certainly where most pain is felt is at the storage layer. Both Storage Area Networks and Backup Applications in my experience have been the cause for most outages and pain in my career and if you search through Google for Service Provider outages over the last 10 or so years the majority are traced back to storage issues.

For years we where at the mercy of the big Storage Vendors and had been dealing with the issues of monolithic SANs with a bunch of spinning disks with some cache drives all sitting behind a couple of headers…more recently we have seen the emergence of new storage vendors who all seem to promise a world where IOPS are in the millions, latency is non existent and (non)disruptive seems to be the general message. Platform Architects around the world are to a certain extent putting trust in technology that…for the most part, has always had issues and seem to work “most” of the time.

The problem with putting our collective faith in what the vendors promise is that it takes one bad outage to really screw things up. Generally when storage has issues…it really has issues. My experience is that for all the vendor guarantees you get to truly validate ones storage decision during the first bad outage. Typically things work well initially but struggle at scale and this is the single biggest issue with storage today…reliability of performance from the first consumed block of data through to the last.

One thing that frustrates me with the industry is that we effectively fork out the cost of a small house (depending on where you live) yet the industry seems to accept storage issues as a way of life…there is no doubt we are at the mercy of the vendors. They promise the world, take the cash and wait for the inevitable support call. This is where IT Professionals who are responsible for the management and operations of these platforms are at a hiding to nothing…no matter which way it’s spun an outage falls on the Service Provider…and even though it’s 100% an issue within the storage vendors system the end users couldn’t care less…and unless there is public shaming storage vendors will move onto the next hundred 200K sales.

To be fair we are currently going through an exciting time in the storage industry where there is more choice than ever and there are more players coming into the market trying to solve the issues that have plagued platforms for years…a lot of my industry peers and mates work for the new breed of Scale Out, AFA, hyerpconverged and Flash Cache vendors and there is certainly more choice than ever…but the fact still remains that there is risk in every purchase decision and ultimately proof is in the pudding.

Even though at some level we love the pain…surely there is a future where outages based on storage are less common place.

ESXi 5.5: IOPS Limit and mClock scheduler

It’s fair to say that it’s not very often the difference between a 1 and 0 can have such a massive impact on performance…I had read a couple of posts from Duncan Epping in regards to a new DiskIO Schedular introduced in ESXi 5.5 and what it meant for VM Disk Limits compared to how things where done in previous 5.x versions of ESXi.

As mentioned in the posts linked above the ESxi Host Advanced Setting Disk.SchedulerWithReservation is activated by default in ESXi 5.5, so without warning in an upgrade scenario where IOPS limits are used the rules change…and from I found out last night the results are not pretty.

The graphic above is showing the total latency of a VM (On an NFS datastore with an IOPS limit set) running firstly on ESXi 5.1 up until the first arrow when the VM was vMotioned to an in place upgraded host now running ESXi 5.5 Update 2. As soon as the VM lands on the 5.5 host latency skyrockets and remains high until there is another vMotion to another 5.5 host. Interestingly enough the overall latency on the other host isn’t as high as the first but grows in a linear fashion.

The second arrow represents the removal of the IOPS Limits at which point latency falls to its lowest levels in the past 24 hours. The final arrows represents a test where the IOPS Limits was reapplied and the advanced setting Disk.SchedulerWithReservation was set to 0 on the host.

To make sense of the above it seems that the mClock Scheduler being on caused the applied IOPS Limit to generate artificial latency against the VM rendering it pretty much useless. The reasons for this are unknown at this point, but in reading Duncan’s blog on the IOPS Limit Caveat it would seem that due to the new 5.5 mClock Scheduler looking at application level block sizes to work out the IO the applied limit actually crippled the VM. In this case the limit was set to 250 IOPS, so I am wondering if there is some unexpected behaviour happening here even if larger block sizes are being used in the VM.

Suffice to say it looks like the smart thing to be doing is set the Disk.SchedulerWithReservation to 0 and revert back to 5.0/1 behaviour with IOPS Limits in place. If you want to do that on bulk the following PowerCLI command will do the trick.

One more interesting observation I made is that it appears VMs with IOPS limits on iSCSI datastores are not/less effected…however most large NFS datastores with a large number of VMs where. You can see below what happens to datastore latency when I switched off the mClock Scheduler…latency dropped instantly.

I’m not sure if this indicates more general NFS issues with ESXi 5.5…there seems to have been more than a few since Update 1 came out. I’ve reached out to see if this behaviour can be explained so hopefully I can provide an update when that information comes to hand…again, I’m not too sure what the benefit of this change in behaviour is so I’m hoping someone who has managed to digest the Brain Hurting Academic Paper to explain why this was introduced.

In laymen’s terms…an IO is not an IO in the world of VM IOPS Limits anymore.


ESXi 5.5 Update 2: vMotion Fails at 14% with Stale Admission Control and VM Reservations

We are currently in the process of upgrading all of our vCenter Clusters from ESXi 5.1 to 5.5 Update 2 and have come across a bug whereby the vMotion of VMs from the 5.1 hosts to the 5.5 hosts fails at 14% with the following error:

[UPDATE 6/10] –

Observed Conditions:

  • vCenter 5.5 U1/2 (U2 resulted in less 14% stalls, but still occurring)
  • Mixed Cluster of ESXi 5.1 and 5.5 Hosts (U1 or U2)
  • Has been observed happening in fully upgraded 5.5 U2 Cluster
  • VMs have various vCPU and vRAM configuration
  • VMs have vRAM Reservations Unlimited vRAM/vCPU
  • VMs are vCD Managed

Observed Workarounds:

  • Restart Management Agents on vMotion Destination Host (hit + miss)
  • vMotion VM to 5.1 Host if available
  • Remove vRAM Reservation and Change to Unlimited vCPU/vRAM
  • Stop and start VM on different host (not ideal)

We are running vCenter 5.5 Update 1 with an number of  Clusters that where on ESXi 5.1 of which some act as Provider vDCs for vCloud Director. Upgrading the Clusters which are not vCloud Providers (meaning VMs aren’t vCD managed or have vCD reservations applied) didn’t result in the issue and we where able to upgrade all hosts to ESXi 5.5 Update two without issue.

There seemed to be no specific setting or configuration of the VMs that ultimatly got stuck during a vMotion from a 5.1 to 5.5 host however they all have memory reservations of various sizes based on our vCloud Allocation Pool settings.

Looking through the host.d logs on the 5.5 Host acting as the destination for the vMotion we see the following entry:

Some of the key entries where

After trying to scour the web for guidance we came across this VMwareKB which listed the error specifically…but specifies VMs already powered off and not ones powered on and under vMotion…in any case the resolutions that are suggested not viable in our environment…namely the stopping and starting of VMs and rejigging of the Memory Reservation settings.
We engaged VMware Support and raised a Support Request…after a little time working with support it was discovered that this may be an internal bug which first appeared as fixed in ESXi 5.5 Extras Patch 4 but appears to have slipped past the Update 2 release. The bug relates to stale Admission Control Constraints and the workaround suggested was to restart the management agents of the destination 5.5 host…in addition to that if this was occurring in a vCloud Provider Cluster it was suggested that the Host be disabled from within vCloud Director and then trigger the Redeploy All VMs action as shown below
This effectively triggers a Maintenance Mode task in vCenter against the host, which for mine is no different to triggering a Maintenance Mode from vCenter directly…however the specific workaround in vCD environments was to use this method in conjunction with the destination Host management agent restart.
Results have been mixed up to this point and we have faced VMs that simply wont vMotion at all even after trying every combination stated above. Our own workaround has been to shuffle those VMs to other 5.1 hosts in the cluster and hope that they will vMotion to a 5.5 host as we roll through the rest of the upgrades. Interestingly enough we have also seen random behaviour where if, for example 6 VMs are stuck at 14%…after a agent restart only 4 of them might be affected and in subsequent restarts it might only be 2…this just tells me that the big is fairly hit and miss and needs some more explaining as to the specific circumstance and reasoning that trigger the condition.
We are going to continue to work with VMware Support to try and get a more scientific workaround until the bug fix is released and I will update this post with the outcomes of both of those action items when they become available…its more of a rather inconvenient bug… but still…VM uptime is maintained and that’s all important.
If anyone has had similar issues feel free to comment: