Monthly Archives: March 2016

Veeam Cloud Connect on World Availability Day

Today (30th of March) is Veeam’s World Availability Day. This is a new day that Veeam has declared to make people aware about how availability plays a part in all organizations as an extension of Backup and Recovery. In it’s self…WAD is a marketing initiative from Veeam that backs onto World Backup Day…which is happening tomorrow (31st of March).

Availability plays a significant role for all organizations, regardless of industry or size. Enterprise organizations now rely on 24/7/365, and Availability is officially a necessary survival tool. Modern users today expect unfettered access to ALL applications and data, from streaming media, to CRM tools, and much more. When user expectations are not met, the ramifications are often costly and time-consuming, and they are likely to keep everyone, from the IT pro to the CEO, up all night.

Having worked in the Hosting and Cloud industry all my career I have had to deal with backup and recovery throughout that time…and for the most it’s been a royal PITA! You only have to look back at the early days of this blog to see how frustrating and unreliable my experiences with backups have been. Of recent times though the backup industry has improved and with Veeam leading the way in the virutalisation space it’s become less a pain and more about how to take the next step to achieve availability as Veeam talks about in its marketing messages.

Veeam Cloud Connect and Cloud Connect Replication play a big part in extending Veeam B&R platform into an availability suite and allows for the wrapping of DRaaS around these service provider offerings. In February I took part in a joint Zettagrid/Veeam Webinar where myself and Nelson Simao went through Cloud Connect and how it benefits SPs and Veeam customers in getting key data offsite in the form of VM Backups and Replicas.

In the presentation I go through the Cloud Connect offerings and compare the differences between backup and replication as it pertains to availability and the industry terms that sometimes get confused or misused when talking about it.

I’ve posted the video below and the slidedeck can be downloaded via the link below.

So before World Backup Day tomorrow…have a happy World Availability Day today


NSX Bytes: Controller Deployment Gone Bad?

With NSX becoming more and more widely available there are more NSX home labs being stood up and with that the chances of the NSX Controllers failing due to “Home Lab” nested issues become more prevalent. The NSX Controllers are Ubuntu Linux VMs and like any Linux VM are fairly sensitive to storage latency and other issues that appear in #NestedESXi or lab environments.

In one of my labs I came across an issue where I needed to redeploy all the NSX Controllers due to the VMs effectively breaking due to the storage being ripped out from under them…however when I went to redeploy the latency of the underlying nested storage was still not that great and the deployment got stuck in a loop as shown below.

No matter what I tried…vCenter restart, NSX Manager Reboot or Host Reboot the end result was the status remaining in the spinning state. If I tried to deploy another controller I would get the following error.

Controller IP address allocation failed for reason : cluster already contains controller of IP x.x.x.x

In my case the VM existed with the IP address configured against the VM however I could not access the cli to check NSX Cluster Status due to the fact the VM was in a pretty bad way.

Taking a look at the IP Pool allocations…even though the error said that the IP was in use, it wasn’t listed as such…meaning it was trying to use the first IP in the pool regardless.

Before going into the fix, it should be noted that if this scenario was to happen, and you where down to your last controller in production you would be best served to call up VMware Support and work through the restore options as without any controllers your VXLAN Unicast traffic isn’t going to be updated via the VTEPS and things will eventually grind to a halt. It’s also worth reading the VMware Docs on what to do if even one Controller is lost in a cluster. If this is in a lab scenario…we can be a little harsher!

While the Controller status is spinning in a Deploying state you can’t interact with it via the Web Client. You need to turn to the API to delete the NSX Controller and start again or deploy a new cluster set. First you will need the CONTROLLER-ID which can be easily seen via the Web Client. To remove the controller you need to call the API below using the Delete method. If the stuck controller is the last one in the cluster you need to add the ?forceRemoval=True option at the end of the call.

Once complete you should get a 200 status and a job data ID. If you check back in at the Web Client you should see the Controller VM being deleted and it being removed from the list under Controller Nodes. We are now free of the Deploying Loop and can rebuild or extend the NSX Controller cluster as is appropriate.


vSphere 6 Update 2 – Whats In It for Service Providers

It’s been just over a week since VMware released vSphere 6 Update 2 and I thought I would go through some of the key features and fixes that are included in the latest versions of vCenter and ESXi. As usual I generally keep an eye out for improvements that relate back to Service Providers who use vSphere as the foundation of their Managed or Infrastructure as as Service offerings.

New Features:

Without question the biggest new feature is the release of VSAN 6.2. I’ve covered this release in previous blog posts and when you upgrade to ESXi 6.0 Update 2 the VSAN 6.2 bits are present within the kernel. Some VSAN services are actually in play regardless if you have it configured or not…which is interesting. With the new pricing for VSAN through the vCAN program, Service Providers now can seriously think about deploying VSAN for their main IaaS platforms.

The addition of support for High Speed Ethernet Links is significant not only because of the addition of 25G and 50G link speeds means increased throughput for converged network cards allowing for more network traffic to flow through hosts and switches for Fault Tolerance, vMotion, Storage vMotions and storage traffic but also because it allows SPs to think about building Edge Clusters for networking services such as NSX and allow the line speeds to take advantage of even higher backends.

From a manageability point of view the Host Client HTML5 user interface is a welcome addition and hopefully paves the way for more HTML5 management goodness from VMware for not only hosts…but also vCenter its self. There is a fair bit of power already in the Host Client and I can bet that admins will start to use it more and more as it continues to evolve.

For vCenter the addition of Two-Factor Authentication using RSA or Smartcard technology is an important feature for SPs to use if they are considering any sort of certification for their services. For example many government based certifications such as IRAP require this to be certified.

Resolved Issues:

There are a bunch of resolved issues in this build and I’ve gone through the rather extensive list to pull out the biggest fixes that relate to my experience in service provider operations.


  • Upgrading vCenter Server from 5.5 Update 3b to 6.0 Update 1b might fail if SSLv3 is disabled on port 7444 of vCenter Server 5.5 Update 3b. An upgrade from vCenter Server 5.5 Update 3b to 6.0 Update 2 works fine if SSLv3 is disabled by default on 7444 port of vCenter Server 5.5 Update 3b.
  • Deploying a vApp on vCloud Director through the vApp template fails with a Profile-Driven storage error. When you refresh the storage policy, an error message similar to the following is displayed: The entity vCenter Server is busy completing an operation.
  • Delta disk names of the source VM are retained in the disk names of the cloned VM. When you create a hot clone of a VM that has one or more snapshots, the delta disk names of the source VM are retained in the cloned VM
  • vCenter Server service (vpxd) might fail during a virtual machine power on operation in a Distributed Resource Scheduler (DRS) cluster.


  • Hostd might stop responding when you execute esxcli commands using PowerCLI resulting in memory leaks and memory consumption exceeding the hard limit.
  • ESXi mClock I/O scheduler does not work as expected. The ESXi mClock I/O scheduler does not limit the I/Os with a lesser load even after you change the IOPS of the VM using the vSphere Client.
  • After you upgrade Virtual SAN environment to ESXi 6.0 Update 1b, the vCenter Server reports a false warning similar to the following in the Summary tab in the vSphere Web Client and the ESXi host shows a notification triangle
  • Attempts to perform vMotion might fail after you upgrade from ESXi 5.0 or 5.1 to 6.0 Update 1 or later releases. An error message similar to the following is written to the vmware.log file.
  • Virtual machine performance metrics are not displayed correctly as the performance counter cpu.system.summation for a VM is always displayed as 0
  • Attempts to perform vMotion with ESXi 6.0 virtual machines that have two 2 TB virtual disks created on ESXi 5.0 fail with an error messages similar to the following logged in the vpxd.log file:2015-09-28T10:00:28.721+09:00 info vpxd[xxxxx] [[email protected] sub=vpxLro opID=xxxxxxxx-xxxxxxxx-xx] [VpxLRO] — BEGIN task-919 — vm-281 — vim.VirtualMachine.relocate — xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx(xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx)

The mClock fix highlighted above is a significant fix for those that where looking to use IOPS limiting. It’s basically been broken since 5.5 Update 2 and also impacts/changes the way in which you would think IOPS are interpreted through the VM to storage stack. For service providers looking to introduce IOPS limited to control the impact noisy neighbors the fix is welcomed.

As usual there are still a lot of known issues and some that have been added or updated to the release notes since release date. Overall the early noise coming out from the community is that this Update 2 release is relatively solid and there have been improvements in network performance and general overall stability. Hopefully we don’t see a repeat of the 5.5 Update 2 issues or the more recent bug problems that have plagued previous released…and hopefully not more CBT issues!

vSphere 6.0 Update 2 has a lot of goodness for Service Providers and continues of offer the number one vitalization platform from which to build managed and hosted services on top of. Go grab it now and put it through it’s paces before pushing to production!


Quick Fix: VCSA MAC Address Conflict Migration Fail + Fix

The old changing of the MAC address causing NIC/IP issues has raised it’s ugly head again…this time during a planned migration of one of our VCSA from one vCenter to another vCenter. Ran into an issue where the source VM running on an ESXi 5.5 Update 3a host, had a MAC address that wasn’t compatible with the destination host running ESXi 6.0.

Somewhere along the lines during the creation of this VM (and others in this particular Cluster) the assigned MAC address conflicted with the reserved MAC ranges from VMware. There is a workaround to this as mentioned in this post but It was too late for the VCSA and upon reboot I could see that it couldn’t find eth0 and had created a new eth1 interface that didn’t have any IP config. The result was all vCenter services struggled to start and eventually timed out rendering the appliance pretty much dead in the water.

To fix the issue, firstly you need to not down the current MAC address assigned to the VM.

There is an additional eth interface picked up by the appliance that needs to be removed and an adjustment made to the initial eth0 config…After boot, wait for the services to time out (can take 20-30 minutes) and then ALT-F1 into the console. Login using the root account and enable the shell.

  • cd /etc/udev/rules.d/
  • Modify 70-persistent-net.rules and change the MAC address to the value recorded for eth0.
  • Comment out or remove the line corresponding to the eth1 interface.
  • Save and close the file and Reboot the Appliance.

All things being equal you should have a working VSCA again.


CloudPhysics Exploration Mode – New Host View

Late last year CloudPhysics released their VM Exploration mode feature which allowed for a detailed look into what was happening holistically to a VM with the ability to view key metrics and VM related events over an extended period of time. Last weekend CloudPhsyics extended this to also include Hosts. Extending Exploration Mode to include Hosts further improves the proactive monitoring and analysis capabilities of the CloudPhysics platform as it looks to break away from its roots of Card Views.

With Exploration Mode now encompassing both VMs and hosts, administrators can focus in on a workload performance issue and “replay” the environment to correlate events, resource utilization patterns, and environment changes in the seconds, minutes or days leading up to a problem in application performance or availability.

To view a Host with Exploration Mode, you use the new Search Virtual Machines and Hosts bar at the top of the CloudPhysics Web Console.

Once the Host has been selected you get taken to a dashboard that gives you configuration details of the Host, any changes (Power Operations, Snapshot, vMotions) that have been done against that VM in the provided date range and a performance graph that covers CPU, Memory, Network and Storage. There is also an Issues section which alerts you to any possible configuration issues or mismatch.

There is also the introduction of a Tab View which allows you to have open multiple Hosts and/or VMs to compare against…what would be nice would be the ability to overlay both Hosts and VMs to try and pinpoint events or key metrics points as they happened.

Below is a YouTube video from a recent webinar where the CloudPhysics VP Product Management Chris Schin walks through the way the platform uses Exploration Mode to identify root causes of VM Performance issues.

If you are interested in giving CloudPhysics a try, they have a free edition which you can register for and download here: CloudPhysics Free Edition

Released: vCloud Director SP 8.0.1

Somewhat hidden among the major releases that came our from VMware HQ yesterday was a point release for vCloud Director SP 8.x. This takes the vCD SP Build to 3635340. The build is mainly a maintenance release but does address a major issue with the upload and downloading of OVF and other media in Chrome. After the update, when clients hit the upload/download functionality from vCloud the system will prompt to download a new version of the vCloud Director Client Integration Plug-In and also installs a new services (vmware-csd.exe) that supports the OVF and media uploads and downloads.

There is also increased supportability for vSphere, NSX and vCNS. vCD 8.01 now supports vSphere versions 6.0U2, 6.0U1, and 5.5U3, NSX versions 6.2.2, 6.2.1, 6.1.6, and 6.1.5 and vCloud Networking and Security versions and

As usual I’ve gone through the Resolved Issues list and highlighted the ones I feel are most relevant…the ones in red are issues we have seen in our vCloud Zones and Zettagrid Labs.

  • vCloud Director Web client incorrectly displays a pending status for some objects
    When you select an object such as a virtual data center, network, or Edge gateway in the left pane of the vCloud Director Web client, vCloud Director sometimes incorrectly displays a status of Pending. This occurs when no actions have been performed on the object within the time limit set to clean up audit event data.
  • Attempts to add a second external network to an Edge gateway fail
    When you attempt to add a second external network to an Edge gateway, the operation fails with the following error message.Violation of UNIQUE KEY constraint 'uq_plac_subj_ite_su_id_i_u_i_t'. Cannot insert duplicate key in object 'dbo.placement_subject_item'.
  • vCloud Director does not migrate virtual machine after switching to a storage policy with SDRS disabled
    After you modify a virtual machine to use a storage policy with a datastore that has SDRS disabled, vCloud Director fails to migrate the virtual machine.
  • Attempts to change the storage policy of a virtual machine on a storage cluster sometimes fail
    When you attempt to change the storage policy of a virtual machine on storage cluster sometimes fail with an error message similar to the following.Could not execute JDBC batch update; SQL [insert into placement_subject_item (existing_item, item_type, item_uri, subject_id, target_uri, id) values (?, ?, ?, ?, ?, ?)]; constraint [null]; nested exception is org.hibernate.exception.ConstraintViolationException: Could not execute JDBC batch update
    - Could not execute JDBC batch update
  • vCloud API incorrectly returns a virtual machine snapshot size of 0
    When you use the vCloud API call SnapshotSection, the vCloud API incorrectly returns a snapshot size of 0.

There are still a bunch of Unresolved Issues with workarounds but again shows the commitment to continuing to mature the platform…looking forward to the enhancements coming in the next major release. For those with the correct entitlements…download here.



VSAN 6.2 – Price Changes for Service Providers!

Who said big corporations don’t listen to their clients! VMware have come to the party in a huge way with the release of VSAN 6.2…and not only from a technical point of view. Ever since the release of VSAN the pricing structure for vCloud Air Service Provider partners has been off the mark in terms of the commercial viability in having VSAN deployed at scale. The existing model was hurting any potential uptake in the HCI platform beyond deployments for Management Clusters and alike.

I have been on VMware’s back since March 2014 when VSPP pricing was first revealed and I wrote a detailed blog post back in October where I compared the different vCAN bundles options and showed some examples of how it did not scale.

For me VMware need to look at slightly tweaking the vCAN cost model for VSAN to either allow some form of tiering (ie 0-500GB .08, 500-1000GB .05, 1TB-5TB .02 and so on) and/or change over the metering from allocated GB to consumed GB which allows Service Providers to take advantage of over provisioning and only pay for whats actually being consumed in the VSAN Cluster.

Since that post (obviously not only off the back of the noise I was making) the VSAN Product and Marketing teams have gone out to vCAN Partners and spent time going over possible tweaks to the billing structure for VSAN by surveying partners and trying to achieve the best balance going forward to help increase VSAN uptake.

With the release of VSAN 6.2 in ESXi 6.0 Update 2 this week, VMware have announced new pricing for vCAN Partners…the changes are significant and will represent a complete rethink of VSAN at scale for IaaS Providers. Furthermore the changes are also strategically important for VMware in an attempt to secure the storage market for existing vCAN partners.

The changes are indeed significant and not only is the billing metric based on used or consumed per GB storage now, but in somewhat of a surprise to me the VSPP Point Per Month component has been slashed. Further from that the Enterprise Plus was rumored to be listed at .18 VSPP Point per allocated GB which was going to price out AF even more…now with AF Enterprise costing as much as what the Standard cost in VSPP points per used GB that whole conversation has changed.

Below is an an example software only cost of 10 Hosts (64GB RAM) with 100TB of Storage (60% used capacity average) with an expected utilization of 80% assuming 2 hosts are reserved for HA. Old numbers are in the brackets to the right and is based on VSAN Standard. It must be noted that these are rough numbers based on the new pricing and for the specifics of the new costings you will need to engage with your local vCAN Partner Account manager.

VSAN 80TB Allocated
(48TB Used)
vRAM 410GB (205GB Reserved) Per Month
 $1,966 ($6,400) $1,433  $3,399 ($7,833)

If we scale that to 20 hosts with 128GB and 200TB of Storage (60% used capacity average) with an expected utilization of 80% assuming 4 hosts are reserved for HA.

VSAN 160TB Allocated
(96TB Used)
vRAM 1.6TB (820GB Reserved) Per Month
 $3,932 ($12,800) $5,734  $9,666 ($18,534)

In a real world example based on figures I’ve seen…Taking into account just VSAN…if you have 500TB worth of storage provisioned, of which 200TB was consumed with Advanced plus the Enterprise Add-On the approx. cost of running VSAN comes down from ~30K to ~6K per month.

The idea now that Service Providers can take advantage of thin provisioning plus the change in metric to used or consumed storage and makes VSAN a lot more attractive at scale…while there are still no break points in terms of total storage blocks the conversation around VSAN being to expensive has now, for the most disappeared.

Well done to the VSAN and vCAN product and marketing teams!


These figures are based on my own calculations and are based on the VSPP Point value being $1US. This value will be different for vCAN partners depending on the bundle and points level they are on through the program. I have been accurate with my figures but errors and omissions may exist.

Veeam 9 Update 1 Released – Cloud Connect Enhancements

Last Friday the Veeam Cloud Service Provider members would have received an email informing them that Update 1 for Veeam 9 Backup & Replication had been RTM’ed and was available to download for Cloud Connect partners and selected Veeam customers with outstanding support cases. The update introduces a couple of significant feature enhancement for Cloud Connect Replication and a few outstanding bug fixes.

Unlike previous patch releases services providers don’t have to upgrade right away as no matter what…versions of V9 Build are compatible with server or client versions of Build and vice versa. I’d still recommended that VCSPs upgrade as soon as their change processes allows to gain the extra functionality.

General Cloud Connect enhancements include enhanced logic of handling cloud gateway unavailability to prevent jobs from hanging for extended time, improved job performance and reduced configuration database load by optimizing realtime job statistics queries and the added ability to control Cloud Connect connections timeout.

Cloud Connect Replication has had some features added…the biggest being support for replica seeding, support for replication from backup (except from backup files stored in cloud repositories), fully automated upgrade of network extensions appliances with product updates installation and improved failover issues reporting to tenants.

So still we wait for the ability to use Scale Out Repositories and Single File VM Backup chains…but hopefully that is next on the cards for the product teams.

Once the release is GA’ed in a week or so I’ll look to link to the VeeamKB for a detailed look at the fixes but for the moment, if you have the ability to download the update do so and have it applied to your instances.

NSX Bytes: Host Preparation Errors Out or Doesn’t Complete…Spins In Progress

I’ve had this situation happen a couple of times in production where NSX Host Preparation Install or Upgrade seems to hang forever without timing out and spinning the In Progress message…most recently in one of the Zettagrid Labs where we run NSX-v in #NestedESXi environments we had the exact issue and I decided (with the help of @santinidaniel) to get to the bottom of what could cause this process to spin without success.

Looking at the Event Log in vCenter against the host you see the  error below as the job eventually fails against the host…but remains in progress in the Web Client. Restarting the NSX Manager, the Hosts or the Web Client services doesn’t stop the progress. To reset the NSX Web Client Status and stop it thinking that the VIB deployments are still in progress you need to kill the install job from the ESX Agent Manager (thanks @santinidaniel) and get the opportunity to Resolve/Uninstall the host modules again.

Once you get past the spinning you will see a failure and you will get an error message similar as highlighted below:

VIB module for agent is not installed on host <hostname> (_VCNS_xxx_Cluster_VMware Network Fabri)

This error is detailed in a couple of forum posts and in a VMwareKB here. You can try and do a manual NSX Module install however as the KB suggests the problem more than likely lies with Update Manager. In my specific case the Update Manager wasn’t happy with a recent vCenter SSL Certificate Upgrade and required a repair to allow the new SSL cert.

Once Update Manager was sorted the Host Preparation Tasks went through without issue and we where on our way. Point and case here is that you should never forget about the supporting infrastructure…even in Nested Lab environments. Know your dependencies and requirements and ensure they are all in working order…can save you some headaches along the way.


NSX Bytes: NSX-v 6.2.2 Released

Last week VMware released version 6.2.2 of NSX-v. The 6.2.2 release is mainly aimed at patching a security hole in the glibc libraries as well as removing a constraint with the DHCP Pool setup allowing .local domains in the config and improving the user experience when configuring the Distributed Firewall. Once following the standard NSX Upgrade Process you will see NSX Manager build and as shown below the NSX Controllers will be at build 6.2.46427

In terms of the Distributed Firewall UI changes I couldn’t spot any drastic changed except for some consistency changes with the Rule Name and Action option.

I’ve gone through the list of resolved issues and pulled out the list of fixes that impact my day to day with NSX-v the most. The ones in Red being of extra significance.

  • Security patch to address the glibc vulnerability, CVE-2015-7547
    The 6.2.2 release delivers a security patch to address CVE-2015-7547.
  • Rules not pushed to host
    DFW rule/ip list updates failed to be scheduled due to task framework resource limitations in NSX Manager. Error message showed a failure to queue tasks for Change Notification threads.
  • Traffic interrupted for 50 seconds after HA failover on ESG
    This issue was caused when NSX failed to synchronize the static routes among the HA NSX Edge nodes.
  • NSX load balancer IP_HASH health check issue
    In IPVS, when using the source-ip hash algorithm, if the selected backend server’s weight equals 0, a “service unavailable” reply is sent even if there are healthy backend servers.
  • Packet sent to LIF without DHCP relay results in PSOD
    The ESXi host suffers a PSOD if a DHCP unicast packet is addressed to the IP of a LIF that is expected to have DHCP relay enabled but the actual receiving LIF does not have DHCP relay enabled.
  • DFW Publishing error
    Modifying and saving DFW rules in filtered mode may result in rules not being saved and published.

Still a fairly long list of Known Issues so make sure you are aware of what is still problematic in the product to ensure you are not impacted…even with that great to see so much work going into making NSX-v even an even more reliable and stable platform.


« Older Entries