Category Archives: Storage

Cloud Tier Data Migration between AWS and Azure… or anywhere in between!

At the recent Cloud Field Day 5 (CFD#5) I presented a deep dive on the Veeam Cloud Tier which was released as a feature extension of our Scale Out Backup Repository (SOBR) in Update 4 of Veeam Backup & Replication. Since we went GA we have been able to track the success of this feature by looking at Public Cloud Object Storage consumption by Veeam customers using the feature. As of last week Veeam customers have been offloading petabytes of backup data into Azure Blob and Amazon S3…not counting the data being offloaded to other Object Storage repositories.

During the Cloud Field Day 5 presentation, Michael Cade talked about the Portability of Veeam’s data format, around how we do not lock our customers into any specific hardware or format that requires a specific underlying File System. We offer complete Flexibility and Agnosticity where your data is stored and the same is true when talking about what Object Storage platform to choose for the offloading of data with the Cloud Tier.

I had a need recently to setup a Capacity Tier extent that was backed by an Object Storage Repository on Azure Blob. I wanted to use the same backup data that I had in an existing Amazon S3 backed Capacity Tier while still keeping things clean in my Backup & Replication console…luckily we have built in a way to migrate to a new Object Storage Repository, taking advantage of the innovative tech we have built into the Cloud Tier.

Cloud Tier Data Migration:

During the offload process data is tiered from the Performance Tier to the Capacity Tier effectively Dehydrating the VBK files of all backup data only leaving the metadata with an Index that points to where the data blocks have been offloaded into the Object Storage.

This process can also be reversed and the VBK file can be rehydrated. The ability to bring the data back from Capacity Tier to the Performance Tier means that if there was ever a requirement to evacuate or migrate away from a particular Object Storage Provider, the ability to do so is built into Backup & Replication.

In this small example, as you can see below, the SOBR was configured with a Capacity Tier backed by Amazon S3 and using about 15GB of Object Storage.

The first step is to download the data back from the Object Storage and rehydrate the VBK files on the Performance Tier extents.

There are two ways to achieve the rehydration or download operation.

  1. Via the Backup & Replication Console
  2. Via a PowerShell Cmdlet
Rehydration via the Console:

From the Home Menu under Backups right click on the Job Name and select Backup Properties. From here there is a list of the Files contained within the job and also the objects that they contain. Depending on where the data is stored (remembering that the data blocks are only even in one location… the Performance Tier or the Capacity Tier) the icon against the File name will be slightly different with files offloaded represented with a Cloud.

Right Clicking on any of these files will give you the option to Copy the data back to the Performance Tier. You have the choice to copy back the backup file or the backup files and all its dependancies.

Once this is selected, a SOBR Download job is kicked off and the data is moved back to the Performance Tier. It’s important to note that our Intelligent Block Recovery will come into play here and look at the local data blocks to see if any match what is trying to be downloaded from the Object Storage… if so it will copy them from the Performance Tier, saving on egress charges and also speeding up the process.

In the image above you can see the Download Job working and only downloaded 95.5MB from Object Storage with 15.1GB copied from the Performance Tier… meaning the data blocks for the most that are local are able to be used for the rehydration.

The one caveat to this method is that you can’t select bulk files or multiple backup jobs so the process to rehydrate everything from the Capacity Tier can be tedious.

Rehydration via PowerShell:

To solve that problem we can use PowerShell to call the Start-VBRDownloadBackupFile cmdlet to do the bulk of the work for us. Below are the steps I used to get the backup job details, feed that through to variable that contains all the file names, and then kick off the Download Job.

The PowerShell window will then show the Download Job running

Completing the Migration:

No matter which way the Download job is initiated, we can see the progress form the Backup & Replication Console under the Jobs section.

And looking at the Disk and Network sections of Windows Resource Monitor we can see connections to Amazon S3 pulling the required blocks of data down.

Once the Download job has been completed and all VBKs have been rehydrated, the next step is to change the configuration of the SOBR Capacity Tier to point at the Object Storage Repository backed by Azure Blob.

The final step is to initiate an offload to the new Capacity Tier via an Offload Job…this can be triggered via the console or via Powershell (as shown in the last command of the PowerShell code above) and because we have already a set of data that satisfies the conditions for offload (sealed chains and backups outside the operational restore window) data will be dehydrated once again…but this time up to Azure Blob.

The used space shown below in the Azure Blob Object Storage matches the used space initially in Amazon S3 All recovery operations show Restore Points on the Performance Tier and on the Capacity Tier as dictated by the operational restore window policy.
Conclusion:

As mentioned in the intro, the ability for Veeam customers to have control of their data is an important principal revolving around data portability. With the Cloud Tier we have extended that by allowing you to choose the Object Storage Repository of your choice for cloud based storage or Veeam backup data…but also given you the option to pull that data out and shift when and where desired. Migrating data between AWS, Azure or any platform is easily achieved and can be done without too much hassle.

References:

https://helpcenter.veeam.com/docs/backup/powershell/object_storage_data_transfer.html?ver=95u4

Update 4 for Service Providers – Extending Backup Repositories to Object Storage with Cloud Tier

When Veeam Backup & Replication 9.5 Update 4 went Generally Available in late January I posted a What’s in it for Service Providers blog. In that post I briefly outlined all the new features and enhancements in Update 4 as it related to our Veeam Cloud and Service Providers. As mentioned each new major feature deserves it’s own seperate post. I’ve covered off the majority of the new feature so far, and today i’m covering what I believe is Veeam’s most innovative feature that has been released of late… The Cloud Tier.

As a reminder here are the top new features and enhancements in Update 4 for VCSPs.

Cloud Tier:

When I was in charge of the architecture and design of Service Provider backup platforms, without question the hardest and most challenging aspect of designing the backend storage was how to facilitate storage consumption and growth. The thirst to backup workloads into the cloud continues to grow and with it comes the growth of that data and the desire to store it for longer. Even yesterday I was talking to a large Veeam Cloud & Service Provider who was experiencing similar challenges with managing their Cloud Connect and IaaS backup repositories.

Cloud Tier in Update 4 fundamentally changes the way in which the initial landing zone for backups is designed. With the ability to offload backup data to cheaper storage the Cloud Tier, which is part of the Scale-Out Backup Repository allows for a more streamlined and efficient Performance Tier of backup repository while leveraging scalable Object Storage for the Capacity Tier.

How it Works:

The innovative technology we have built into this feature allows for data to be stripped out of Veeam backup files (which are part of a sealed chain) and offloaded as blocks of data to Object Storage leaving a dehydrated Veeam backup file on the local extents with just the metadata remaining in place. This is done based on a policy that is set against the Scale-out Backup Repository that dictates the operational restore window of which local storage is used as the primary landing zone for backup data and processed as a Tiering Job every four hours. The result is a space saving, smaller footprint on the local storage without sacrificing any of Veeam’s industry-leading recovery operations. This is what truly sets this feature apart and means that even with data residing in the Capacity Tier, you can still perform:

  • Instant VM Recoveries
  • Entire computer and disk-level restores
  • File-level and item-level restores
  • Direct Restore to Amazon EC2, Azure and Azure Stack
What this Means for VCSPs:

Put simply it means that for providers who want to offload backup data to cheaper storage while maintaining a high performance landing zone for more recent backup data to live  the Cloud Tier is highly recommended. If there are existing space issues on the local SOBR repositories, implementing Cloud Tier will relieve pressure and in reality allow VCSPs to not have to seek further hardware purchase to expand the storage platforms backing those repositories.

When it comes to Cloud Connect Backup, the fact that Backup Copy Jobs are statistically the most used form of offsite backup sent to VCSPs the potential for savings is significant. Self contained GFS backup files are prime candidates for the Cloud Tier offload and given that they are generally kept for extended periods of time, means that it also represents a large percentage of data stored on repositories.

Having a look below you can see an example of a Cloud Connect Backup Copy job from the VCSP side when browsing from Explorer.

You can see the GFS files are all about 22MB in size. This is because they are dehydrated VBKs with only metatdata remaining locally. Those files where originally about 10GB before the offload job was run against them.

Wrap Up:

With the small example shown above, VCSPs should be starting to understand the potential impact Cloud Tier can have on the way they design and manage their backup repositories. The the ability to leverage Amazon S3, Azure Blog and any S3 Compatible Object Storage Platform means that VCSPs have the choice in regards to what storage they use for the Capacity Tier. If you are a VCSP and haven’t looked at how Cloud Tier can work for your service offering…what are you waiting for?

Glossary:

Object Storage Repository -> Name given to repository that is backed by Amazon S3, S3, Azure Blob or IBM Cloud

Capacity Tier -> Name given to extent on a SOBR using an Object Storage Repository

Cloud Tier -> Marketing name given to feature in Update 4

Resources:

Harness the power of cloud storage for long-term retention with Veeam Cloud Tier

Quick Look: Cloud Tier SOBR Offload Job

With the release of Update 4 for Veeam Backup & Replication 9.5 we introduced the Cloud Tier, which is an extension of the Scale Out Backup Repository (SOBR). The Cloud Tier allows for data to be stripped out of Veeam backup files and offloaded as blocks of data to Object Storage leaving a dehydrated Veeam backup file on the local extents with just the metadata remaining in place. This is done based on a policy that is set against the SOBR that dictates the operational restore window of which local storage is used as the primary landing zone for backup data. The result is a space saving, smaller footprint on the local storage.

Overview of Offload Job:

By default the offload job is run against the data located on the Performance Tier extents of the SOBR every 4 hours. This is a set value that can not be changed. To offload the backup data to the Capacity Tier, the Offload job does the following:

  • Verifies whether backup chains located on the Performance Tier extents satisfy validation criteria and can be offloaded to object storage.
  • Collects verified backup chains from each Performance Tier extent and sends them directly to object storage in the form of data blocks.
  • Saves each session results to the configuration database so that you can review them upon request.

The job and job details can be viewed from the History Menu under System or the Home Menu under Last 24 Hours.

The details of the job will show how much data was offloaded to the Capacity Tier per VM residing on the SOBR. It will show statistics on how much data was processed, read and transferred. Once this job has completed, the local backup files only contain job metadata with the data residing on the Object Storage.

Forcing The Offload Job:

As mentioned, the Offload Job by default is set to run every 4 hours from the creation initial configuration of the Capacity Tier extent on the SOBR. The default value of 4 hours can not be modified however if you want to force the job to run you have two options.

First option is through the UI, under the Backup Infrastructure Menu and under Scale-Out Repositories, do a CONTROL+Click against the SOBR and select the Run Tiering Job Now option. This is hidden by default as an option and will only be shown with the CONTROL+Click

Second option is to run the following PowerShell command:

This tiggers the Offload Job to run.

Note that once the Offload Job has been forced the 4 hours counter is reset to when the job was run…ie the next job will be 4 hours from the time the job was forced.

It’s important to understand that running the job on demand doesn’t necessary mean that you will offload data to the Capacity Tier any quicker. The conditions around operations restore window and sealed backup chains still need to be in place for the job to do its thing. Having the job run six times a day (every 4 hours) is generally going to be more than enough for most instances.

If no data has been offloaded, you will see the following in the job details:

Wrap Up and More Cloud Tier:

To learn more about the Cloud Tier head to my veeam.com post here, and also check our Rhys Hammonds post here. Also look out for a new Veeam White Paper being released in the next month or so which will deep dive into the Cloud Tier in more detail. I will post a few more posts on the Cloud Tier over the next few weeks as well looking at some more use cases and features.

References:

https://helpcenter.veeam.com/docs/backup/vsphere/capacity_tier.html?ver=95u4

 

 

How to Copy Amazon S3 Buckets with AWS CLI

I am doing some work on validated restore scenarios using the new Veeam Cloud Tier that backed by an Object Storage Repository pointing at an Amazon S3 Bucket. So that I am not messing with the live data I wanted a way to copy and access the objects from another bucket or folder. There is no option at the moment to achieve this via the AWS Console, however it can be done via the AWS CLI.

First step was to ensure I had the AWS CLI installed on my MBP and it was at the latest version:

For the first part of the copy process, I cheated and created a new Bucket from the AWS Console that was based on the one I wanted to copy.

Next step is to make sure that the AWS CLI is configured with the correct AWS Access and Secret keys. Once done, the command to copy/sync buckets is a simple one.

Obviously the time to complete the operation will depend on the amount of Objects in the Bucket and whether its cross region or local. It took about 4 hours to copy across ~50GB of data from US-EAST-2 to US-WEST-2 going at about 4MB/s. By default the process is shown on the screen.

Once the first pass was complete I ran the same command again which will this time look for differences between the source and destination and only sync the differences. You can run the command below to view the Total Objects and Total Size of both buckets for comparison.

That is it! Pretty simple process. I’ll blog around the actual reason behind the Veeam Cloud Tier requirement and put this into action at a later date!

References:

https://docs.aws.amazon.com/cli/latest/userguide/install-macos.html

https://aws.amazon.com/premiumsupport/knowledge-center/move-objects-s3-bucket

More Than Meets the Eye… Veeam Backup Performance

Recently I was sent a link to a video that showed an end user comparing Veeam to a competitors offering covering backup performance, restore capabilities and UI. It mainly focused on the comparison of incremental backup jobs and their completion times. It showed that the Veeam job was taking a significantly longer time to complete for the same dataset. The comparison was chalk and cheese and didn’t paint Veeam in a very good light.

Now, without knowing 100% the backend configuration that the user was testing against or the configuration of the Veeam components, storage platforms and backup jobs vs the competitors setup…the discrepancy between both job completion times was too great and something had to be amiss. This was not an apples to apples comparison.

TL:DR – I was able to cut the time to complete an incremental backup job from 24 minutes to under 4 minutes by scaling out Veeam infrastructure components and tweaking transport mode options to suit the dataset from using the default configuration settings and server setup. Lesson being to not take inferred performance at face value, there are a lot of factors that go into backup speed.

Before I continue, it’s important for me to state that I have seen Veeam perform exceptionally well under a number of different scenarios and know from my own experience at my previous roles at large service providers that it can handle 1000s of VMs and scale up to handle larger environments. That said, like any environment you need to understand how to properly scope and size backup components to suite…that includes more than just the backup server and veeam components… storage obviously plays a huge role in backup performance as does the design of the virtualisation platform as well as networking.

I haven’t set out in this post to put together a guide on how to scale Veeam…rather I have focused on trying to debunk the differential in job completion time I saw in the video. I went into my lab and started to think about how scaling Veeam components and choosing different options for backups and proxies can hugely impact the time it takes for backup jobs to complete. For the testing I used a Veeam Backup & Replication server that I had deployed with the Update 4 release and had active jobs that where in operation for more than a month.

The Veeam Backup & Replication server is on a VMware Virtual Machine running on modest 2vCPU and 8GB of RAM. Initially I had this running as an all in one Backup Server and Proxy setup. I have a SOBR repository consisting of two ReFS formatted local VMDK (underlying storage is vSAN) extents and a Capacity Tier extent going to Amazon S3. The backup job consisted of nine VMs with a footprint of about 162GB. A small dataset but one which was based of real world workloads. The job was running Forward Incremental, keeping 14 restore points running every 4 hours with a Synthetic Full running every 24 hours (initial purpose of was to demo Cloud Tier) and on average the incremental’s where taking between 23 to 25 minutes to complete.

The time to complete the incremental job was not an issue for me in the lab, but it provided a good opportunity to test out what would happen if I looked to scale out the Veeam components and tweak the default configuration settings.

Adding Proxies

As a first step I deployed three virtual proxies (2vCPU and 4GB RAM) into the environment and configured the job to use them in hot-add mode. Right away the job time decreased down by ~50% to 12 minutes. Basically, more proxies means more disks are able to be processed in parallel when in hot-add mode so it’s logical that the speed of the backup would increase.

Adding More Proxies

As a second step I deployed three more proxies into the environment and configured the job to use all six in hot-add mode. This didn’t result in a significantly faster time to what it was at three proxies, but again, this will vary depending on the amount of VMs and size of those VMs disks in a job. Again, Veeam offers the flexibility to scale and grow with the environment. This is not a one size fits all approach and you are not locked into a particular appliance size that may max out requiring additional significant spend.

Change Transport Mode

Next I changed the job back to use three proxies, but this time I forced the proxies to use network mode. To read more about Transport modes, head here.

This resulted in a sub 4 minute job completion to read a similar incremental data set as the previous runs. A ~20 minute difference after just a few tweaks of the configuration!

Removing Surplus Proxies and Balancing Things Out

For the example above I introduced proxies however the right balance of proxies and network mode was the most optimal configuration for this particular job in order to lower the job completion window. In fact in my last test I was able to get the job to complete consistently around the 5 minute mark by just using the one proxy with network mode.

Conclusion:

So with that, you can see that by tweaking some settings and scaling out Veeam components I was able to bring a job completion time down by more than 20 minutes. Veeam offers the flexibility to scale and grow with any environment. This is not a one size fits all approach and you are not locked into a particular appliance size that will scale out requiring additional and significant spend while also locking you in by way of restricted backup date portability. Again, this is just a quick example of what can be done with the flexibility of the Veeam platform and that what you see as a default out of the box experience (or a poorly configured/problematic environment) isn’t what should be expected for all use cases. Milage will vary…but don’t let first/misleading impressions sway you…there is always more than meets the eye!

Sources:

https://bp.veeam.expert/

Hybrid World… Why IBM buying RedHat makes sense!

As Red October came to a close…at a time when US Tech stocks were taking their biggest battering in a long time the news came out over the weekend that IBM had acquired RedHat for 34 billion dollars! This seems to have taken the tech world by surprise…the all-cash deal represents a massive 63% premium on the previous close of RedHat’s stock price…all in all it seems ludicrous.

Most people that I’ve talked to about it and from reading comments on social media and blog sites suggests that the deal is horrible for the industry…but I’ve felt this is more a reaction to IBM than anything. IBM has a reputation as swallowing up companies whole and spitting them out the other side of the merger process a shell of what they once were. There has also been a lot of empathy for the employees of RedHat, especially from ex-IBM employees who have experience inside the Big Blue machine.

I’m no expert on M&A and I don’t pretend to understand the mechanics behind the deal and what is involved…but when I look at what RedHat has in its stable, I can see why IBM have made such an aggressive play for them. On the surface it seems like IBM are in trouble with their stock price and market capitalization falling nearly 20% this year and more than 30% in the last five years…they had to make a big move!

IBM’s previous 2013 acquisition of SoftLayer (for a measly 2 billion USD) helped them remain competitive in the Infrastructure as a Service space and if you believe the stories, have done very well out of integrating the SoftLayer platform into what was BlueMix, and is now IBM Cloud. This 2013 Forbes article on the acquisition sheds some light as to why this RedHat acquisition makes sense and is true to form for IBM.

IBM sees the shift of big companies moving to the cloud as a 20-year trend…

That was five years ago…and since then a lot has happened in the Cloud world. Hybrid cloud is now the accepted route to market with a mix of on-premises, IaaS and PaaS hosted and hyper-scale public cloud services being the norm. There is no one cloud to rule them all! And even though AWS and Azure continue to dominate and be front of mind there is still a lot of choice out there when it comes to how companies want to consume their cloud services.

Looking at RedHat’s stable and taking away the obvious Linux distro’s that are both enterprise and open sources the real sweet spot of the deal lies in RedHat’s products that contribute to hybrid cloud.

I’ve heard a lot more noise of late about RedHat OpenStack becoming the platform of choice as companies look to transform away from more traditional VMware/Hyper-V based platforms. RedHat OpenShift is also being considered as an enterprise ready platform for containerization of workloads. Some sectors of the industry (Government and Universities) have already decided on their move to platforms that are backed by RedHat…the one thing I would comment here is that there was an upside to that that might now be clouded by IBM being in the mix.

Rounding out the stable, RedHat have a Cloud Suite which encompasses most of the products listed above. CloudForms for Infrastructure as Code, with Ansible for orchestration…together with RedHat Virtualization together with OpenStack and OpenShift..it’s a decent preposition!

Put all that together with the current services of IBM Cloud and you start to have a compelling portfolio covering almost all desired aspects of hybrid and multi cloud service offerings. If the acquisition of SoftLayer was the start of a 20 year trend then IBM are trying to keep themselves positioned ahead of the curve and very much in step with the next evolution of that trend. That isn’t to say that they are not playing catchup with the likes of VMware, Microsoft, Amazon, Google and alike, but I truly believe that if they don’t butcher this deal they will come out a lot stronger and more importantly offer valid completion in the market…that can only be a good thing!

As for what it means for RedHat itself, their employees and culture…that I don’t know.

References:

https://www.redhat.com/en/about/press-releases/ibm-acquire-red-hat-completely-changing-cloud-landscape-and-becoming-world%E2%80%99s-1-hybrid-cloud-provider

IBM sees the shift of big companies moving to the cloud as a 20-year trend

Quick Fix: vSAN Health Reports iSCSI Target Service Stopped

A few weeks ago I wrote about using iSCSI as a backup repository target. While still running this POC in my environment I came across an error in the vSAN Health Checker stating the vSAN iSCSI target service was in a Failed state. Drilling down into the vSAN Health check tree I could see a Service Runtime status of stopped as shown below against the host.

This host had recently been marked as unreachable in vCenter and required a Management Agent reset to bring it back online. There is a chance that that process stopped the iSCSI Target service but did not start it. In any case there is an easy way to see the status of the services and then get them back online.

Once that’s been done, a re-run of the vSAN Health checker will show that the issue has been resolved and the iSCSI Target Service on the host is now running.

References:

https://kb.vmware.com/s/article/2147603

 

Released: vSAN 6.7 – HTML5 Goodness, Enhanced Health Checks and More!

VMware has announced the general availability of vSAN 6.7. As vSAN continues to grow, VMware are very buoyant about how it’s performing in the market. With some 10,000 customers at a run rate of over 600 million they claim to lead the HyperConverged market with a 32% market share. From my point of view it’s great to see vSAN being deployed across 250 cloud providers and have it as the cornerstone storage of the VMware Cloud on AWS solution. vSAN 6.7 is focusing on intuitive operational experience, consistent application experience and holistic support experience.

New Features and Enhancements:

  • HTML5 User Interface
  • Embedded vROPs plugin for HTML5 User Interface
  • Support for Windows Failover Cluster using iSCSI
  • Adaptive Resync Performance Improvements
  • Destaging Performance Improvements
  • More Efficient data placement during Host Decommissioning
  • Improved Space Efficiency
  • Faster Failover with Redundant vSAN Networks
  • Optimized Witness Traffic Seperation
  • Stretched Cluster Improvements
  • Host Affinity for Next-Gen Applications
  • Health Check Enhancements
  • Enhanced Diagnostics
  • vSAN Support Insight
  • 4Kn Device Support
  • Improved FIPS 140-2 Validation Security

There are a lot of enhancements in this release and while not as ground breaking at the 6.6 release last year, there is still a lot to like about how VMware is improving the platform. From the list above, i’ve taken the key ones from my point of view and expanded on them a little.

HTML5 User Interface:

As has been the trend with all VMware products of late, vSAN is getting the Clarity Framework overhaul and is being included in the HTML5 vSphere Web Client with new vSAN tasks and workflows developed from the ground up to simplify the experience. There is also new vSAN functionality that can only be accessed via the HTML5 client.

The legacy Flex client will still be available for use and it’s also worth noting that this is not a direct port of the Flex interface but started from the ground up. This has resulted in a more efficient experience for the user with less clicks and less time to action items. Any new features or enhancements will only be seen in the new HTML5 UI.

Support for Windows Failover Cluster using iSCSI:

A few weeks back I posted around how you could use vSAN as Veeam repository using the iSCSI feature. With vSAN 6.7 there is offical support for Windows Failover Clustering using the vSAN iSCSI service. Lots of people still run MSCS and a lot still use traditional clustering. This supports physical and virtual Guest iSCSI initiators that includes transparent failover of clusters with vSAN iSCSI volumes.

I’m not sure if this now means that iSCSI volumes are supported as Veeam Cloud Repositories…but I will confirm either way.

Adaptive Resync Performance Improvements:

vSAN 6.7 introduces a new Adaptive ReSync feature that will make sure resources are available for VM IO and resync IO. This ensures that under IO stress certain traffic types are not starved of resources and allows more bandwidth to be used when there are periods of less contention. Under contention, resync IO will be guaranteed at least 20% of the bandwidth and if no resync traffic exists, VM IO may consume 100%. This is effectively regulating reads and writes to ensure optimal balance for VM and reync IO.

Destaging Performance Improvements:

vSAN 6.7 looks to be more consistent when talking about data optimizations in the data path. With the faster destaging, data drains more quickly from the write buffer to the capacity tier. This allows the buffer tier to be available for newer IO quicker. This is done via improved in-memory handling of IO during destaging that delivers higher throughput and more consistency which in turn improves the overall performance of VM and resync IO.

More Efficient data placement during Host Decommissioning:

When putting a host in maintenance mode or decommissioning a host you need to select the evacuation type for the objects on that host. This can take time depending on the amount of data. vSAN 6.7 builds on improvements introduced in 6.6 that consolidates replicas living across multiple hosts while maintaining FTT compliance. Is looks for the smallest component to move while results in less data being rebuilt and less temporary space usage. vSAN will provide more intelligence behind the data movement to reduce the time and effort it takes to put a host into maintenance mode.

Improved Space Efficiency:

In previous vSAN versions the VM swap object was always thick provisioned even if the VM it’s self was thin. in vSAN 6.7 this will now be thin by default and also inherit the policy from the VM so that the FTT is the swap object is consistent with the VM which results in more efficient storage. Previous to this, large environments would suffer with a large number of swap files taking up a higher proportionate amount of space.

 

Conclusion:

vSan continues to be improved by VMware and they have addressed some core usability and efficiency features in this 6.7 release. The move to the HTML5 web client was expected, but still good to see while the enhancements in resync and destaging all contributes to platform stability. The enhanced health checks add a new dimension to vSAN troubleshooting and the support insight allows users to get a better view of what’s happening on their instances.

References:

Pre release information and images sourced via VMware EABP

https://blogs.vmware.com/virtualblocks/2018/04/17/whats-new-vmware-vsan-6-7/

 

 

The One Problem with the VCSA

Over the past couple of months I noticed a trend in my top blog daily reporting…the Quick fix post on fixing a 503 Service Unavailable error was constantly in the top 5 and getting significant views. The 503 error in various forms has been around since the early days of the VCSA which usually manifests it’s self with the following.

503 Service Unavailable (Failed to connect to endpoint: [N7Vmacore4Http20NamedPipeServiceSpecE:0x0000559b1531ef80] _serverNamespace = / action = Allow _pipeName =/var/run/vmware/vpxd-webserver-pipe)

Looking at the traffic stats for that post it’s clear to see an upward trend in the page views since about the end of June.

This to me is both a good and bad thing. It tells me that more people are deploying or migrating to the VCSA which is what VMware want…but it also tells me that more people are running into this 503 error and looking for ways to fix it online.

The Very Good:

The vCenter Server Appliance is a brilliant initiative from VMware and there has been a huge effort in developing the platform over the past three to four years to get it to a point where it not only became equal to vCenter’s deployed on Windows (and relying on MSSQL) but surpassed it in a lot of features especially in the vSphere 6.5 release. Most VMware shops are planning to or have migrated from Windows to the VCSA and for VMware labs it’s a no brainer for both corporate or homelab instances.

Personally I’ve been running VCSA’s in my various labs since the 5.5 release, have deployed key management clusters with the VCSA and more recently have proven that even the most mature Windows vCenter can be upgraded with the excellent migration tool. Being free of Windows and more importantly MSSQL is a huge factor in why the VCSA is an important consideration and the fact you get extra goodies like HA and API UI’s adds to it’s value.

The One Bad:

Everyone who has dealt with storage issues knows that it can lead to Guest OS file systems errors. I’ve been involved with shared hosting storage platforms all my career so I know how fickle filesystems can be to storage latency or loss of connectivity. Reading through the many forums and blog posts around the 503 error there seems to be a common denominator of something going wrong with the underlying storage before a reboot triggers the 503 error. Clicking here will show the Google results for VCSA + 503 where you can read the various posts mentioned above.

As you may or may not know the 6.5 VCSA has twelve VMDKs, up from 2 in the initial release and to 11 in the 6.0 release. There a couple of great posts from William Lam and Mohammed Raffic that go through what each disk partition does. The big advantage in having these seperate partitions is that you can manage storage space a lot more granularly.

The problem as mentioned is that the underlying Linux file system is susceptible to storage issue. Not matter what storage platform you are running you are guaranteed to have issues at one point or another. In my experience Linux filesystems don’t deal will with those issues. Windows file systems seem to tolerate storage issue much better than their Linux counterparts and without starting a religious war I do know about the various tweaks that can be done to help make Linux filesystems more resilient to underlying storage issues.

With that in mind, the VCSA is very much susceptible to those same storage issues and I believe a lot of people are running into problems mainly triggered by storage related events. Most of the symptoms of the 503 relate back to key vCenter services unable to start after reboot. This usually requires some intervention to fix or a recovery of the VCSA from backup, but hopefully all that’s needed is to run an e2fsck against the filesystem(s) impacted.

The Solution:

VMware are putting a lot of faith into the VCSA and have done a tremendous job to develop it up to this point. It is the only option moving forward for VMware based platforms however there needs to be a little more work done into the resiliency of the services to protect against external issues that can impact the guest OS. PhotonOS is now the OS of choice from 6.5 onwards but that will not stop the legacy of susceptibility that comes with Linux based filesystems leading to issues such as the 503 error. If VMware can protect key services in the event of storage issues that will go a long way to improving that resiliency.

I believe it will get better and just this week VMware announced a monthly security patch program for the VCSA which shows that they are serious (not to say they where not before) about ensuring the appliance is protected but I’m sure many would agree that it needs to offer reliability as well…this is the one area where the Windows based vCenter has an advantage still.

With all that said, make sure you are doing everything possible to have the VCSA housed on as reliable as possible storage and make sure that you are not only backing up the VCSA and external dependancies correctly but understand how to restore the appliance including understanding of the inbuilt backup mechanisms for backing up the config and the PostGres database.

I love and would certainly recommend the VCSA…I just want to love it a little more without having to deal with possibility of having the 503 server error lurking around every storage event.

References:

http://www.vmwarearena.com/understanding-vcsa-6-5-vmdk-partitions-mount-points/

http://www.virtuallyghetto.com/2016/11/updates-to-vmdk-partitions-disk-resizing-in-vcsa-6-5.html

https://www.veeam.com/wp-vmware-vcenter-server-appliance-backup-restore.html

https://kb.vmware.com/kb/2091961

https://kb.vmware.com/kb/2147154

vSphere 6.5 Update 1 – What’s in it for Service Providers

Late last week VMware released vSphere 6.5 Update 1 which included updated builds of both vCenter and ESXi and as per usual I will go through some of the key features and fixes that are included in the latest versions of vCenter and ESXi. When looking through the release notes I generally keep an eye out for improvements that relate back to Service Providers who use vSphere as the foundation of their Managed or Infrastructure as a Service offerings. This update also contains an update to vSAN which is now at 6.6.1 so I’ll spend some time looking at what’s been added there.

 

New Features and Enhancements:

Without question this is a significant patch release for vCenter and ESXi and the length of the release notes is testament to that point. In terms of new features there isn’t anything groundbreaking but there are a few nice additions like being able to run the VCSA GUI and CLI installers on Windows 2012 and 2012 R2 as well as 2016 and macOS Sierra and Ubuntu 17.04 OS is supported for Guest OS Customization. vCenter now supports Microsoft SQL Server 2014 SP2 2016 and SP1 as well as some increased configuration maximums supporting Linked Mode with 15 vCenter Instances, 5000 ESXi hosts and 50,000 powered on virtual machines.

Ability to Upgrade or Migrate from vCenter 6.0 Update 3:

This release addresses the previous limitation in the upgrade and migration path for those running vSphere 6.0 U3 in going to vSphere 6.5. I know this will make a lot of providers happy as I know a lot that had to go to 6.0 Update 3 to address existing bug in the platform but where not yet ready or able to go to 6.5 at the time.

HTML5 Client Update:

The HTML5 Web Client has gotten it’s own update that brings it up to speed with the 3.15 Fligng version however it’s still partially functional which remains somewhat frustrating…The online documentation for supported functionality has been updated to vSphere 6.5U1 and is available here.

The list below is of the main updates in this release.

  • DRS/HA VM overrides
  • SDRS rules
  • Content Library – further actions
  • Roles and Global Permissions
  • Download multiple files as zip
  • Distributed Switch – further actions
  • Fault Tolerance
  • SPBM
  • VM Hardware – further items
  • Apply Customize Guest OS during Clone
  • VM Migration – further actions (compute+storage, Cross VC, batch)
vSAN Features:

For service providers, vSAN 6.6 was another major release that sured up vSANs status as a serious storage platform for service provider platforms.

vSAN 6.6.1 introduces three key new features:

  • VMware vSphere Update Manager (VUM) integration
  • Performance Diagnostics in vSAN Cloud Analytics
  • Storage Device Serviceability enhancement

The ability to upgrade with VUM is a nice touch and continues to improve on the usability and manageability of vSAN. For a full look at what’s new in this release for vSAN 6.6.1 head to this blog post.

Resolved Issues:

There are a bunch of resolved issues in this release and I’ve gone through the rather extensive list to pull out the biggest fixes that relate to my experience in service provider operations and have also extended this to include fixes that relate to backup operations. The majority of what I pick out related to storage, networking hosts and VM operations…the core of any platform, but even more important in the service provider world. The ones in red are specific fixes that relate to issues that iv’e come across…good to see them addressed!

vCenter:
  • First-boot failure occurs when upgrading from vSphere 5.5 or 6.0 to vSphere 6.5 on Windows If an older version of the OpеnSSL DLLs are installed, upgrading to vSphere 6.5 fails to run because the older DLL versions are loaded
  • Affinity rules configured on vCenter Server 5.5 can cause crashes after upgrading to vCenter Server 6.5 Migrating a VM with affinity rules configured while on vCenter Server 5.5 to a cluster that has affinity rules configured on vCenter Server 6.0 or 6.5 can cause vCenter Server to crash.
  • VM Snapshot Size (GB) alarm is not triggered after the VM is powered on. VM Snapshot Size (GB) alarm is reset if the virtual machine is shut down. Alarm fails to trigger after the VM is powered on. This issue occurs in alarms based on VM Snapshot (GB) and Vm Total Size on Disk because their status is altered when the power state of the VM is changed. This issue occurs because disk usage of a VM is the same regardless of the VM power state.
  • When you add ports to a vSphere Distributed Switch you get an error Because of a race condition, when you add ports to a vSphere Distributed Switch you get the error message: Cannot create a new port because number of ports exceeds 2147483647, maximum number of ports allowed on vDS.
  • A runtime exception “Unable to retrieve data about the distributed switch” might occur while upgrading vSphere Distributed Switch (vDS) from 5.0 to 6.5 version When you try to upgrade an existing distributed switch after the vCenter upgrade is completed, the runtime exception Unable to retrieve data about the distributed switch might occur in the wizard and the distributed switch cannot be upgraded. The exception is a result of unexpected value NULL for a LACP property of the distributed switch, instead of TRUE or FALSE, as LACP is not supported for the current version of vSphere Distributed Switch.
  • Host configuration might not be available after vCenter Server restarts After a vCenter Server restart, the host configuration might not be available if vCenter Server cannot communicate with the host. After connectivity is restored, the configuration becomes available.
  • OVF tool fails to upload OVF or OVA files larger than 10 GB If you use OVF tool fails to upload OVF or OVA files larger than 10 GB, the upload might fail.

ESXi:

  • Virtual machine crashes on ESXi 6.5 when multiple users log on to Windows Terminal Server VM Windows 2012 terminal server running VMware tools 10.1.0 on ESXi 6.5 stops responding when many users are logged in.vmware.log will show similar messages to2017-03-02T02:03:24.921Z| vmx| I125: GuestRpc: Too many RPCI vsocket channels opened.
    2017-03-02T02:03:24.921Z| vmx| E105: PANIC: ASSERT bora/lib/asyncsocket/asyncsocket.c:5217
    2017-03-02T02:03:28.920Z| vmx| W115: A core file is available in "/vmfs/volumes/515c94fa-d9ff4c34-ecd3-001b210c52a3/h8-
    ubuntu12.04x64/vmx-debug-zdump.001"
    2017-03-02T02:03:28.921Z| mks| W115: Panic in progress... ungrabbing 
  • An ESXi host might fail with purple diagnostic screen when collecting performance snapshots
    An ESXi host might fail with purple diagnostic screen when collecting performance snapshots with vm-support due to calls for memory access after the data structure has already been freed.An error message similar to the following is displayed:
  • Full duplex configured on physical switch may cause duplex mismatch issue with igb native Linux driver supporting only auto-negotiate mode for nic speed/duplex setting
    If you are using the igb native driver on an ESXi host, it always works in auto-negotiate speed and duplex mode. No matter what configuration you set up on this end of the connection, it is not applied on the ESXi side. The auto-negotiate support causes a duplex mismatch issue if a physical switch is set manually to a full-duplex mode.
  • An ESXi host might fail with a purple screen and a Spin count exceeded (refCount) – possible deadlock with PCPU error An ESXi host might fail with a purple screen and a Spin count exceeded (refCount) - possible deadlock with PCPU error, when you reboot the ESXi host under the following conditions:
    • You use the vSphere Network Appliance (DVFilter) in an NSX environment
    • You migrate a virtual machine with vMotion under DVFilter control
  • A Virtual Machine (VM) with e1000/e1000e vNIC might have network connectivity issues For a VM with e1000/e1000e vNIC, when the e1000/e1000e driver tells the e1000/e1000e vmkernel emulation to skip a descriptor (the transmit descriptor address and length are 0), a loss of network connectivity might occur.
  • An ESXi host might stop responding when you migrate a virtual machine with Storage vMotion between ESXi 6.0 and ESXi 6.5 hosts The vmxnet3 device tries to access the memory of the guest OS while the guest memory preallocation is in progress during the migration of virtual machine with Storage vMotion. This results in an invalid memory access and the ESXi 6.5 host failure.
  • Modification of IOPS limit of virtual disks with enabled Changed Block Tracking (CBT) fails with errors in the log files To define the storage I/O scheduling policy for a virtual machine, you can configure the I/O throughput for each virtual machine disk by modifying the IOPS limit. When you edit the IOPS limit and CBT is enabled for the virtual machine, the operation fails with an error The scheduling parameter change failed. Due to this problem, the scheduling policies of the virtual machine cannot be altered. The error message appears in the vSphere Recent Tasks pane.You can see the following errors in the /var/log/vmkernel.log file:2016-11-30T21:01:56.788Z cpu0:136101)VSCSI: 273: handle 8194(vscsi0:0):Input values: res=0 limit=-2 bw=-1 Shares=1000
    2016-11-30T21:01:56.788Z cpu0:136101)ScsiSched: 2760: Invalid Bandwidth Cap Configuration
    2016-11-30T21:01:56.788Z cpu0:136101)WARNING: VSCSI: 337: handle 8194(vscsi0:0):Failed to invert policy
  • When you hot-add an existing or new virtual disk to a CBT (Changed Block Tracking) enabled virtual machine (VM) residing on VVOL datastore, the guest operation system might stop responding When you hot-add an existing or new virtual disk to a CBT enabled VM residing on VVOL datastore, the guest operation system might stop responding until the hot-add process completes. The VM unresponsiveness depends on the size of the virtual disk being added. The VM automatically recovers once hot-add completes.
  • When you use vSphere Storage vMotion, the UUID of a virtual disk might change When you use vSphere Storage vMotion on vSphere Virtual Volumes storage, the UUID of a virtual disk might change. The UUID identifies the virtual disk and a changed UUID makes the virtual disk appear as a new and different disk. The UUID is also visible to the guest OS and might cause drives to be misidentified.
  • An ESXi host might become unresponsive if the VMFS-6 volume has no space for the journal When opening a VMFS-6 volume, it allocates a journal block. Upon successful allocation, a background thread is started. If there is no space on the volume for the journal, it is opened in read-only mode and no background thread is initiated. Any intent to close the volume, results in attempts to wake up a nonexistent thread. This results in the ESXi host failure.
  • SSD congestion might cause multiple virtual machines to become unresponsiv Depending on the workload and the number of virtual machines, diskgroups on the host might go into permanent device loss (PDL) state. This causes the diskgroups to not admit further IOs, rendering them unusable until manual intervention is performed.
  • Unable to collect vm-support bundle from an ESXi 6.5 host Unable to collect vm-support bundle from an ESXi 6.5 host because when generating logs in ESXi 6.5 by using the vSphere Web Client, the select specific logs to export text box is blank. The options: network, storage, fault tolerance, hardware etc. are blank as well. This issue occurs because the rhttpproxy port for /cgi-bin has a value different from 8303.This issue is resolved in this release.
  • vSphere Storage vMotion might fail with an error message if it takes more than 5 minutes The destination virtual machine of the vSphere Storage vMotion is incorrectly stopped by a periodic configuration validation for the virtual machine. vSphere Storage vMotion that takes more than 5 minutes fails with the The source detected that the destination failed to resume message.
    The VMkernel log from the ESXi host contains the message D: Migration cleanup initiated, the VMX has exited unexpectedly. Check the VMX log for more details.

vSAN:

  • Hosts in a vSAN cluster have high congestion which leads to host disconnects When vSAN components with invalid metadata are encountered while an ESXi host is booting, a leak of reference counts to SSD blocks can occur. If these components are removed by policy change, disk decommission, or other method, the leaked reference counts cause the next I/O to the SSD block to get stuck. The log files can build up, which causes high congestion and host disconnects.
  • vSAN cluster becomes partitioned after the member hosts and vCenter Server reboot If the hosts in a unicast vSAN cluster and the vCenter Server are rebooted at the same time, the cluster might become partitioned. The vCenter Server does not properly handle unstable vpxd property updates during a simultaneous reboot of hosts and vCenter Server.
  • Large File System overhead reported by the vSAN capacity monitor When deduplication and compression are enabled on a vSAN cluster, the Used Capacity Breakdown (Monitor > vSAN > Capacity) incorrectly displays the percentage of storage capacity used for file system overhead. This number does not reflect the actual capacity being used for file system activities. The display needs to correctly reflect the File System overhead for a vSAN cluster with deduplication and compression enabled.

It’s also worth reading through the Known Issues section as there is a fair bit to be aware of in Update 1 and that remain from the GA.

Happy upgrading!

References:

https://docs.vmware.com/en/VMware-vSphere/6.5/rn/vsphere-esxi-651-release-notes.html

https://docs.vmware.com/en/VMware-vSphere/6.5/rn/vsphere-vcenter-server-651-release-notes.html

https://blogs.vmware.com/vsphere/2017/07/second-vsphere-client-html5-update-in-vsphere-6-5u1.html

https://blogs.vmware.com/virtualblocks/2017/07/27/introducing-hci-powered-by-vsan-6-6-1/

« Older Entries