Tag Archives: Storage

The One Problem with the VCSA

Over the past couple of months I noticed a trend in my top blog daily reporting…the Quick fix post on fixing a 503 Service Unavailable error was constantly in the top 5 and getting significant views. The 503 error in various forms has been around since the early days of the VCSA which usually manifests it’s self with the following.

503 Service Unavailable (Failed to connect to endpoint: [N7Vmacore4Http20NamedPipeServiceSpecE:0x0000559b1531ef80] _serverNamespace = / action = Allow _pipeName =/var/run/vmware/vpxd-webserver-pipe)

Looking at the traffic stats for that post it’s clear to see an upward trend in the page views since about the end of June.

This to me is both a good and bad thing. It tells me that more people are deploying or migrating to the VCSA which is what VMware want…but it also tells me that more people are running into this 503 error and looking for ways to fix it online.

The Very Good:

The vCenter Server Appliance is a brilliant initiative from VMware and there has been a huge effort in developing the platform over the past three to four years to get it to a point where it not only became equal to vCenter’s deployed on Windows (and relying on MSSQL) but surpassed it in a lot of features especially in the vSphere 6.5 release. Most VMware shops are planning to or have migrated from Windows to the VCSA and for VMware labs it’s a no brainer for both corporate or homelab instances.

Personally I’ve been running VCSA’s in my various labs since the 5.5 release, have deployed key management clusters with the VCSA and more recently have proven that even the most mature Windows vCenter can be upgraded with the excellent migration tool. Being free of Windows and more importantly MSSQL is a huge factor in why the VCSA is an important consideration and the fact you get extra goodies like HA and API UI’s adds to it’s value.

The One Bad:

Everyone who has dealt with storage issues knows that it can lead to Guest OS file systems errors. I’ve been involved with shared hosting storage platforms all my career so I know how fickle filesystems can be to storage latency or loss of connectivity. Reading through the many forums and blog posts around the 503 error there seems to be a common denominator of something going wrong with the underlying storage before a reboot triggers the 503 error. Clicking here will show the Google results for VCSA + 503 where you can read the various posts mentioned above.

As you may or may not know the 6.5 VCSA has twelve VMDKs, up from 2 in the initial release and to 11 in the 6.0 release. There a couple of great posts from William Lam and Mohammed Raffic that go through what each disk partition does. The big advantage in having these seperate partitions is that you can manage storage space a lot more granularly.

The problem as mentioned is that the underlying Linux file system is susceptible to storage issue. Not matter what storage platform you are running you are guaranteed to have issues at one point or another. In my experience Linux filesystems don’t deal will with those issues. Windows file systems seem to tolerate storage issue much better than their Linux counterparts and without starting a religious war I do know about the various tweaks that can be done to help make Linux filesystems more resilient to underlying storage issues.

With that in mind, the VCSA is very much susceptible to those same storage issues and I believe a lot of people are running into problems mainly triggered by storage related events. Most of the symptoms of the 503 relate back to key vCenter services unable to start after reboot. This usually requires some intervention to fix or a recovery of the VCSA from backup, but hopefully all that’s needed is to run an e2fsck against the filesystem(s) impacted.

The Solution:

VMware are putting a lot of faith into the VCSA and have done a tremendous job to develop it up to this point. It is the only option moving forward for VMware based platforms however there needs to be a little more work done into the resiliency of the services to protect against external issues that can impact the guest OS. PhotonOS is now the OS of choice from 6.5 onwards but that will not stop the legacy of susceptibility that comes with Linux based filesystems leading to issues such as the 503 error. If VMware can protect key services in the event of storage issues that will go a long way to improving that resiliency.

I believe it will get better and just this week VMware announced a monthly security patch program for the VCSA which shows that they are serious (not to say they where not before) about ensuring the appliance is protected but I’m sure many would agree that it needs to offer reliability as well…this is the one area where the Windows based vCenter has an advantage still.

With all that said, make sure you are doing everything possible to have the VCSA housed on as reliable as possible storage and make sure that you are not only backing up the VCSA and external dependancies correctly but understand how to restore the appliance including understanding of the inbuilt backup mechanisms for backing up the config and the PostGres database.

I love and would certainly recommend the VCSA…I just want to love it a little more without having to deal with possibility of having the 503 server error lurking around every storage event.

References:

http://www.vmwarearena.com/understanding-vcsa-6-5-vmdk-partitions-mount-points/

http://www.virtuallyghetto.com/2016/11/updates-to-vmdk-partitions-disk-resizing-in-vcsa-6-5.html

https://www.veeam.com/wp-vmware-vcenter-server-appliance-backup-restore.html

https://kb.vmware.com/kb/2091961

https://kb.vmware.com/kb/2147154

ESXI 6.5 Storage Performance Issues Resolved in Update 1

I originally came across the issue of slow storage performance with the native vmw_ahci driver that comes bundled with ESXi 6.5 just as I was first playing with my SuperMicro SYS-5028D-TN4T in my homelab. After publishing a couple of posts about the workaround shortly afterwards the issue become quiet prevalent in the community and the post continues to get decent traffic, meaning that the issues impacted quiet a few people out there.

The good news is that with the release of vSphere 6.5 Update 1 there is a fix for the problem in the form of updated drivers for the AHCI module. William Lam has been quick to blog about the fix and if you had previously disabled the driver you will need to re-enable it.

This VMwareKB covers the specific patch as listed in the release notes:

No confirmation as of yet if it actually does the trick, but the release notes look promising as the assumption is that it will resolve the issues so that homelabbers and people using the driver in production systems can rest easy.

References:

https://docs.vmware.com/en/VMware-vSphere/6.5/rn/vsphere-esxi-651-release-notes.html

https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2149910

http://www.virtuallyghetto.com/2017/07/ahci-vmw_ahci-performance-issue-resolved-in-esxi-6-5-update-1.html

vSAN 6.6 – What’s In It For Service Providers

Last February when VMware released VSAN 6.2 I stated that “Things had gotten Interesting” with regards to the 6.2 release of vSAN finally marking it’s arrival as a serious player in the Hyper-converged Infrastructure (HCI) market. vSAN was ready to be taken very seriously by VMware’s competitors. Fast forward fourteen months and apart from the fact we have confirmed the v in vSAN is a lower case with the product name officially changing from Virtual SAN to vSAN…Version 6.6 was announced last week is set to GA today, and with it comes the biggest list of new features and enhancements in vSANs history.

VMware has decided to break with the normal vSphere release cycle for vSAN and move to patch releases for vSphere that are actually major updates of vSAN. This is why this release is labeled vSAN 6.6 and will be included in the vSphere 6.5EP2 build. The move allows the vSAN team to continue to enhance the platform outside of the core vSphere platform and I believe it will deliver at least 2 update releases per year.

Looking at the new features and enhancements of the vSAN 6.6 release it’s clear to see that the platform has matured and given the 7000+ strong customer base it’s also clear to see that its being accepted more and more for critical workloads. From a service provider point of view I know of a lot more vCloud Air Network partners that have implemented vSAN as not only their Management HCI platform, but also now their customer HCI compute and storage  platforms.

A lot for Service Providers to like:

As shown in the feature timeline above there are over 20+ new features and enhancements but for me the following ones are most relative to vCAN Service Providers who are using, or looking to use vSAN in their offerings. I will expand on the ones in red as I see them as being the most significant of the new features and enhancements for service providers.

  • Native encryption for data-at-rest
  • Compliance certifications
  • vSAN Proactive Drive HA for failing drives
  • Resilient management independent of vCenter
  • Rapid recovery with smart, efficient rebuilds
  • Certified file service & data protection solutions
  • Enhanced vSAN SDK and PowerCLI
  • Simple networking with Unicast
  • vSAN Cloud Analytics for performance
  • vSAN Cloud Analytics with real-time support notification and recommendations*
  • vSAN Config Assist with 1-click hardware lifecycle management
  • Extended Health Services
  • Up to 50% greater IOPS for all-flash with optimized checksum and dedupe
  • Optimized for latest flash technologies
  • Expanded caching tier choice
  • New Docker Volume Driver

Simple networking with Unicast:

As John Nicholson wrote on the Virtual Blocks blog…it’s time to say goodbye to the multicast requirements around vSAN networking traffic. For a history as to why multicast was used, click here. Also it’s worth reading John’s post and also the he goes through the upgrade process as if you are upgrading from previous versions, multicast will still be used unless you make the change as also specified here.

I can attest first hand to the added complexity when it comes to setting up vSAN with multicast and have gone through a couple of painful deployments where the multicast configuration was an issue during initial setup and also caused issue with switching infrastructure that needed to be upgraded to before vSAN could work reliably. In my mind unicast offers a simpler less complex solution with minimal overheads and makes it more transportable across networks.

Performance Improvements:

Service Providers are always trying to squeeze the most out of their hardware purchases and with VMware claiming 50% greater IOPS for all-flash through optimized data services that in theory can enable 150K IOPS per host it appears they will be served well. in addition to optimized checksum and dedupe along with support for the latest flash technologies. The increased performance helps accelerate tenant workloads and provides higher consolidation ratios for those workloads.

Service providers can accelerate new hardware technologies with the support of the latest flash technologies, including solutions like the new breed of NVMe SSDs. These solutions can deliver up to 250% greater performance for write-intensive applications. vSAN 6.6 now offers larger caching drive options that includes 1.6TB flash drives, so that service providers can take advantage of larger capacity flash drives.

Disk Performance Enhancements:

For those that have gone through a vSAN rebuild operation you would know that is can be a long exercise depending on the amount of data and configuration of the vSAN datastore. vSAN 6.6 introduces a new smart rebuild and rebalancing feature along with partial repairs of degraded or absent components. There is also resync throttling and improved visibility into the rebuilding status through the Health Status. Cormac Hogan goes through the improvements in detail here.

From a Service Provider point of view having these enhanced features around the rebuilds it critical to continued quality of service for IaaS customer who live on shared vSAN storage. Shorter and more efficient rebuild times means less impact to customers.

Health Checks and Monitoring Improvements:

vSAN Encryption:

VMware has introduced VM encryption native at the vSAN datastore level. This can be enabled per vSAN Cluster and works with deduplication and compression across hybrid and all-flash cluster configurations. vSAN 6.6 data Encryption is hardware agnostic, there is no requirement to use specialized and more expensive Self-Encrypting Drives (SEDs) which is also a bonus. Jase McCarty has another Virtual Blocks article here that goes through this feature in great detail.

From a Service Provider point of view you can now potentially offer two classes of vSAN backed storage for IaaS customers. One that lives on an Encrypted enabled cluster that’s charged at a premium over non Encrypted clusters. In talking with service providers across the globe, data at rest encryption has become something that potential customers are asking for and most leading storage companies have an encryption story…now so does vSAN and it appears to be market leading.

vSAN 6.6 Licensing:

In terms of the licensing Matrix, nothing too drastic has changed except for the addition of Data at Rest Encryption in the Enterprise bundle, however in a significant move for vCAN Service Providers, QoS IOPS Limiting has been extended across all license types and can now be taken advantage across the board. This is good for Service Providers who look to offer different tiers or storage performance based on IOPS limited…previously it was only available under Enterprise licensing.

Bootstrapping UI:

As a bonus feature that I think will assist vCAN Service Providers is the new Native Bootstrap installer in vSAN 6.6. William Lam has written about the feature here, but for those looking to install their first vSAN node without vSphere available the ability to bootstrap is invaluable. The old manual process is still worth looking at as it’s always beneficial to know what’s going on in the background, but it’s all GUI based now via the VCSA installer.

Conclusion:

vSAN 6.6 appears to be a great step forward for VMware and Service Providers will no doubt be keen to upgrade as soon as possible to take advantage of the features and enhancements that have been delivered in this 6.6 release.

References:

http://cormachogan.com/2017/04/11/whats-new-vsan-6-6/ 

https://storagehub.vmware.com/#!/vmware-vsan/vmware-vsan-6-5-technical-overview

http://vsphere-land.com/news/an-overview-of-whats-new-in-vmware-vsan-6-6.html

https://storagehub.vmware.com/#!/vmware-vsan/vsan-multicast-removal/multicast-removal-steps-and-requirements/1

vSAN 6.6 Encryption Configuration

vSAN 6.6 – Native Data-at-Rest Encryption

Goodbye Multicast

Native VCSA bootstrap installer in vSAN 6.6

ESXi 6.5 Storage Performance Issues and Fix

[NOTE] : I decided to republish this post with a new heading and skip right to the meat of the issue as I’ve had a lot of people reach out saying that the post helped them with their performance issues on ESXi 6.5. Hopefully people can find the content easier and have a fix in place sooner.

The issue that I came across was to do with storage performance and the native driver that comes bundled with ESXi 6.5. With the release of vSphere 6.5 yesterday, the timing was perfect to install ESXI 6.5 and start to build my management VMs. I first noticed some issues when uploading the Windows 2016 ISO to the datastore with the ISO taking about 30 minutes to upload. From there I created a new VM and installed Windows…this took about two hours to complete which I knew was not as I had expected…especially with the datastore being a decent class SSD.

I created a new VM and kicked off a new install, but this time I opened ESXTOP to see what was going on, and as you can see from the screen shots below, the Kernel and disk write latencies where off the charts topping 2000ms and 700-1000ms respectively…In throuput terms I was getting about 10-20MB/s when I should have been getting 400-500MB/s. 

ESXTOP was showing the VM with even worse write latency.

I thought to myself if I had bought a lemon of a storage controller and checked the Queue Depth of the card. It’s listed with a QD of 31 which isn’t horrible for a homelab so my attention turned to the driver. Again referencing the VMware Compatibility Guide the listed driver for the controller the device driver is listed as ahci version 3.0.22vmw.

I searched for the installed device driver modules and found that the one listed above was present, however there was also a native VMware device drive as well.

I confirmed that the storage controller was using the native VMware driver and went about disabling it as per this VMwareKB (thanks to @fbuechsel who pointed me in the right direction in the vExpert Slack Homelab Channel) as shown below.

After the host rebooted I checked to see if the storage controller was using the device driver listed in the compatibility guide. As you can see below not only was it using that driver, but it was now showing the six HBA ports as opposed to just the one seen in the first snippet above.

I once again created a new VM and installed Windows and this time the install completed in a little under five minutes! Quiet a difference! Upon running a crystal disk mark I was now getting the expected speeds from the SSDs and things are moving along quiet nicely.

Hopefully this post saves anyone else who might by this, or other SuperMicro SuperServers some time and not get caught out by poor storage performance caused by the native VMware driver packaged with ESXi 6.5.


References
:

https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2044993

Quick Look – vSphere 6.5 Storage Space Reclamation

One of the cool newly enabled features of vSphere 6.5 is the come back of VMFS storage space reclamation. This feature was enabled in a manual way for VMFS5 datastores and was able to be triggered when you free storage space inside a datastore when deleting or migrating a VM…or consolidate a snapshot. At a Guest OS level, storage space is freed when you delete files on a thinly provisioned VMDK and then exists as dead or stranded space. ESXi 6.5 supports automatic space reclamation (SCSI unmap) that originates from a VMFS datastore or a Guest OS…the mechanism reclaims unused space from VM disks that are thin provisioned.

When storage space is deleted without this automated feature the delete operation leaves blocks of unused space on the datastore. VMFS uses the SCSI unmap command to indicate to the array that the storage blocks contain deleted data, so that the array can unallocate these blocks.

On VMFS6 datastores, ESXi supports automatic asynchronous reclamation of free space. VMFS6 generally supports automatic space reclamation requests that generate from the guest operating systems, and passes these requests to the array. Many guest operating systems can send the unmap command and do not require any additional configuration. The guest operating systems that do not support automatic unmaps might require user intervention.

I was interested in seeing if this worked as advertised, so I went about formatting a new VMFS6 datastore with the default options via the Web Client as shown below:

Heading over the hosts command line I checked the reclamation config using the new esxcli namespace:

Through the Web Client you can only set the Reclamation Priority to None or Low, however through the esxcli command you can set that value to medium or high as well as low or none, but as I’ve literally just found out, these esxcli only settings don’t actually do anything in this release.

For the low setting in terms of reclaim priority and how long before the process kicks off on the datastore, the expectation is that any blocks that are no longer used will be reclaimed within 12 hours. I was keeping track of a couple of VMs and the datastore sizes in general and saw that after a day or so there was a difference in the available storage. 

You can see that I clawed back about 22GB and 14GB on both datastores in the first 24 hours. So my initial testing with this new feature shows that it’s a valued and welcomed edition to the new vSphere 6.5 release. I know that for Service Providers that thin provision but charge based on allocated storage, they will benefit greatly from this feature as it automates a mechanism that was complex at best in previous releases.

There is also a great section around UNMAP in the vSphere 6.5 Core Storage White Paper that’s literally just been released as well and can be found here:

References:

http://pubs.vmware.com/vsphere-65/topic/com.vmware.ICbase/PDF/vsphere-esxi-vcenter-server-65-storage-guide.pdf

https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2057513

vSphere 6.5 Core Storage White Paper Now Available

HomeLab – SuperMicro 5028D-TNT4 Storage Driver Performance Issues and Fix

Ok, i’ll admit it…i’ve had serious lab withdrawals since having to give up the awesome Zettagrid Labs. Having a lab to tinker with goes hand in hand with being able to generate tech related content…point and case, my new homelab got delivered on Monday and I have been working to get things setup so that I can deploy my new NestedESXi lab environment.
The issue that I came across was to do with storage performance and the native driver that comes bundled with ESXi 6.5. With the release of vSphere 6.5 yesterday, the timing was perfect to install ESXI 6.5 and start to build my management VMs. I first noticed some issues when uploading the Windows 2016 ISO to the datastore with the ISO taking about 30 minutes to upload. From there I created a new VM and installed Windows…this took about two hours to complete which I knew was not as I had expected…especially with the datastore being a decent class SSD.
By way of an quick intro (longer first impression post to follow) I purchased a SuperMicro SYS-5028D-TN4T that I based off this TinkerTry Bundle which has become a very popular system for vExpert homelabers. It’s got an Intel Xeon D-1541 CPU and I loaded it up with 128GB or RAM. The system comes with an embedded Lynx Point AHCI Controller that allows up to six SATA devices and is listed on the VMware Compatibility Guide for ESXi 6.5.

I created a new VM and kicked off a new install, but this time I opened ESXTOP to see what was going on, and as you can see from the screen shots below, the Kernel and disk write latencies where off the charts topping 2000ms and 700-1000ms respectivly…In throuput terms I was getting about 10-20MB/s when I should have been getting 400-500MB/s. 

ESXTOP was showing the VM with even worse write latency.

I thought to myself if I had bought a lemon of a storage controller and checked the Queue Depth of the card. It’s listed with a QD of 31 which isn’t horrible for a homelab so my attention turned to the driver. Again referencing the VMware Compatability Guide the listed driver for the conrtoller the device driver is listed as ahci version 3.0.22vmw.

I searched for the installed device driver modules and found that the one listed above was present, however there was also a native VMware device drive as well.

I confirmed that the storage controller was using the native VMware driver and went about disabling it as per this VMwareKB (thanks to @fbuechsel who pointed me in the right direction in the vExpert Slack Homelab Channel) as shown below.

After the host rebooted I checked to see if the storage controller was using the device driver listed in the compatability guide. As you can see below not only was it using that driver, but it was now showing the six HBA ports as opposed to just the one seen in the first snippet above.

I once again created a new VM and installed Windows and this time the install completed in a little under five minutes! Quiet a difference! Upon running a crystal disk mark I was now getting the expected speeds from the SSDs and things are moving along quiet nicely.

Hopefully this post saves anyone else who might by this, or other SuperMicro SuperServers some time and not get caught out by poor storage performance caused by the native VMware driver packaged with ESXi 6.5.


References
:

http://www.supermicro.com/products/system/midtower/5028/SYS-5028D-TN4T.cfm

https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2044993

Free Guide: Building NetApp ONTAP 9 Lab

In computing, there is one thing you shouldn’t compromise on…and that thing is storage. This carries over to Lab or NestedESXi environments as poor lab performance can be just as frustrating as production performance issues. I’ve used a number of nested storage platform’s for my lab environments and I’m always on the lookout for alternative solutions.

When Neil Anderson asked my to write a short introductory post on his new how-to guide Build Your Own NetApp ONTAP 9 Lab I decided to flick through the guide to check it out and see if it could add any value to my future plans for a homelab. The e-book is professionally laid out and has excellent diagrams, notes and step by step…it’s extremely comprehensive.

NetApp Simulator 9 Free eBook – Build Your Own NetApp ONTAP 9 Lab!

While I’ve never been a NetApp guy there does seem to be a level of complexity in the NetApp VSA setup, but with the step by step in the e-book any ambiguity is removed. If you are looking for lab storage this is a great end to end example of how to install and configure one based on ONTOP 9 NetApp Simulator.

Give it a look over here.

VSAN Upgrading from 6.1 to 6.2 Hybrid to All Flash – Part 3

When VSAN 6.2 was released earlier this year it came with new and enhanced features and with the price of SSDs continuing to fall and an expanding HCL it seems like All Flash instances are becoming more the norm and for those that have already deployed VSAN in a Hybrid configuration the temptation to upgrade to All Flash is certainly there. Duncan Epping has previously blogged the overview of migrating from Hybrid to All Flash so I wanted to expand on that post and go through the process in a little more detail. This is the final part of a three part blog series with the process overview outlined below.

Use the links below to page jump.

In part one I covered upgrading existing hosts, expanding an existing VSAN cluster and upgrading the license and disk format. In part two I covered the actual Hybrid to All Flash migration steps and in this last part I will finish off by going through the process of creating a new VSAN Policy, migrate existing VMs to the new policy and  then enable deduplication and compression.

Before continuing it’s worth pointing out that after the Hybrid to All Flash migration you are going to be left with an unbalanced VSAN cluster as the full data evacuation off the last Hybrid host will leave that host without objects. Any new objects created will work to re-balance the cluster however if you want to initiate a proactive re-balance you can tit the re-balance button from the Health status window. For more on this process check out this post from Cormac Hogan.

Create new Policy and Migrate VMs:

To take advantage of the new erasure coding now in the VSAN 6.2 All Flash cluster we will need to create a new storage policy and apply that policy to any existing VMs. In my case all VMs where on the Default VSAN Policy with FTT=1. The example below shows the creation of a new Storage Policy that uses RAID5 erasure coding with FTT=1. If you remember from previous posts the reason for expanding the cluster to four hosts was to cater for this specific policy.

To create the new Storage Policy head to VM Storage Policies from the Home page of the Web Client and click on Create New VM Storage Policy. Give policy a name, click Next and construct Rule-Set 1 which is based on VSAN. Select the Failure tolerance method and choose RAID-5/6 (Erasure Coding) – Capacity.

In this case with FTT=1 chosen RAID5 will be used. Clicking on Next should show that the existing VSAN datastore is compatible with the policy. With that done we can migrate existing VMs off the Default VSAN Policy onto the newly created one.

To get an list of what VMs are going to be migrated have a look at the PowerCLI commands below to get the VMs on the VSAN Datastore and then get their Storage Policy. The last command below gets a list of existing policies.

To apply the new Erasure Coding Storage Policy its handy to get the full name of the policy.

To migrate the VMs to the new policy you can either do it one by one via the Web Client of do it on mass via the following PowerCLI script.

Once run the VMs will have the new policy applied and VSAN will work in the background to get those VM objects compliant. You can see the status of Virtual Disk Placement in the Virtual SAN tab of the Monitor Tab of the cluster.

Enable DeDupe and Compression:

Before I go into the details…for a brilliant overview and explanation of DeDupe and Compression with VSAN 6.2 head to this post from Cormac Hogan. To enable this feature we need to double check that the licensing is correct as detailed in the first post and also ensure that all previous steps relating to the Hybrid to All Flah migration has taken place. To turn on this feature head to the General window under the Virtual SAN Settings menu on the cluster Manage tab and click on the Edit button next to Virtual SAN is Turned ON.

Choose Enabled in the drop down and take note of the checkbox that talks about Allow Reduced Redundancy understanding what that means by reading the info box as shown above. Once you click on the process to enable DeDuplication and Compress will begin…this process will go through an reconfigure all Disk Groups similar to to the process to upgrade from between Hybrid and All Flash. Again this will take some time depending on the number of host, number of disk groups and type of disks in the cluster.

Below I have shown the before and after of the Capacity window under the Virtual SAN tab in the Monitor section of the Cluster view. You can see that before enabled, there is a message saying that DeDeuplication and Compression is disabled.

And after enabling DeDuplication and Compression you start to get some statistics relating to both of them in the window relating to savings and ratios. Even in my small lab environment I started to see some benefits.

With that complete we have finished this series and have gone through all the steps in order to get to an All Flash VSAN Cluster with the newest features enabled.

References:

VSAN 6.2 Part 1 – Deduplication and Compression

VSAN 6.2 Part 2 – RAID-5 and RAID-6 configurations

 

VSAN Upgrading From 6.1 To 6.2 Hybrid To All Flash – Part 2

When VSAN 6.2 was released earlier this year it came with new and enhanced features and with the price of SSDs continuing to fall and an expanding HCL it seems like All Flash instances are becoming more the norm and for those that have already deployed VSAN in a Hybrid configuration the temptation to upgrade to All Flash is certainly there. Duncan Epping has previously blogged the overview of migrating from Hybrid to All Flash so I wanted to expand on that post and go through the process in a little more detail. This is part two of what is now a three part blog series with the process overview outlined below.

Use the links below to page jump.

In part one I covered upgrading existing hosts, expanding an existing VSAN cluster and upgrading the license and disk format. In this part am going to go through the simple task of extending the cluster by adding new All Flash Disk Groups on the host I added in part one and then go through the actual Hybrid to All Flash migration steps.

The configuration of the VSAN Cluster after the upgrade will be:

  • Four Host Cluster
  • vCenter 6.0.0 Update 2
  • ESXi 6.0.0 Update 2
  • One Disk Groups Per Host
  • 1x 480GB SSD Cache and 2x 1000GB SSD Capacity
  • VSAN Erasure Coding Raid 5 FTT=1
  • DeDuplication and Compression On

As mentioned in part one I added a new host to the cluster in order to give me some breathing room while doing the Hybrid to All Flash upgrade as we need to perform rolling maintenance on each hosts in the cluster in order to get to the All Flash configuration. Each host will be entered into maintenance mode and all data evacuated. Before the process is started on the initial three hosts lets go ahead and create a new All Flash Disk Group on the new hosts.

To create the new Disk Group head to Disk Management under the Virtual SAN section of the Manage Tab whilst the Cluster and click on the Create New Disk Group Button. As you can see below I have the option of selecting any of the flash devices claimed as being ok for VSAN.

After the disk selection is made and the disk group created, you can see below that there is now a mixed mode scenario happening where the All Flash host is participating in the VSAN Cluster and contributing to the capacity.

Upgrade Disk Group from Hybrid to All Flash:

Ok, now that there is some extra headroom the process to migrate the existing Hybrid Hosts over to All Flash can begin. Essentially what the process involves is placing the hosts in maintenance mode with a full data migration, deleting any existing Hybrid disk groups, removing the spinning disk, replacing them with flash and then finally creating new All Flash disk groups.

If you are not already aware about maintenance mode with VSAN then it’s worth reading over this VMware Blog Post to ensure you understand that using the VI Client is a big no no. In this case I wanted to do a full data migration which moves all VSAN components onto remaining hosts active in the cluster.

You can track this process by looking at the Resyncing Components section of the Virtual SAN Monitor Tab to see which objects are being copied to other hosts.

As you can see the new host is actively participating in the Hybrid mixed mode cluster now and taking objects.

Once the copy evacuation has completed we can now delete the existing disk groups on the host by highlights the disk group and clicking on the Remove Disk Group button. A warning appears telling us that data will be deleted and also lets us know how much data is currently on the disks. The previous step has ensured that there should be no data on the disk group and it should be safe to (still) select Full data migration and remove the disk group.

Do this for all existing Hybrid disk groups and once all disk groups have been deleted from the host you are ready to remove the existing spinning disks and replace them with flash disks. The only thing to ensure before attempting to claim the new SSDs is that they don’t have any previous partitions on them…if so you can use the ESXi Embedded Host Client to remove any existing partitions.

Warning: Again it’s worth mentioning that any full data data migration is going to take a fair amount of time depending on the consumed storage of your disk groups and the types of disks being used.

Repeat this process on all remaining hosts in the cluster with Hybrid disk groups until you have a full All Flash cluster as shown above. From here we are now able to take advantage of erasure coding, DeDuplication and compression…I will finish that off in part three of this series.

 

VSAN Upgrading from 6.1 to 6.2 Hybrid to All Flash – Part 1

When VSAN 6.2 was released earlier this year it came with new and enhanced features and depending on what version you where running you might not have been able to take advantage of them all right away. Across all versions, Software Checksum was added with Advanced and Enterprise versions getting VSANs implementation of Erasure Coding (RAID 5/6) with Deduplication and Compression available for the All Flash version and QOS IOPS Limiting available in Enterprise only.

With the price of SSDs continuing to fall and an expanding HCL it seems like All Flash instances are becoming more the norm and for those that have already deployed VSAN in a Hybrid configuration the temptation to upgrade to All Flash is certainly there. Duncan Epping has previously blogged the overview of migrating from Hybrid to All Flash so I wanted to expand on that post and go through the process in a little more detail. This is a two part blog post with a lot of screen shots to compliment the process which is outlined below.

Use the links below to page jump.

Warning: Before I begin it’s worth mentioning that this is not a short process so make sure you plan this out relative to the existing size of your VSAN cluster. In talking with other people who have gone through the disk format upgrade the average rate seems to be about 10TB of consumed data per day depending on the type of disks being used. I’ll reference some posts at the end that relates to the disk upgrade process as it has been troublesome for some however also worth pointing out that the upgrade process is non disruptive for running workloads.

Existing Configuration:

  • Three Host Cluster
  • vCenter 6.0.0 Update 2
  • ESXi 6.0.0 Update 1
  • Two Disk Groups Per Host
  • 1x 200GB SSD and 2x 600GB HDD
  • VSAN Default Policy FTT=1

Upgrade Existing Hosts to 6.0 Update 2:

At the time of writing ESXi 6.0.0 Update 2 is the latest release and the builds that contain the VSAN 6.2 codebase. From the official VMware Upgrade matrix it seems you can’t upgrade from VSAN versions older than 6.1, so if you are on 5.x or 6.0 releases you will need to take note of this VMwareKB to get to ESXI 6.0.0 Update 2. A great resource for the latest builds as well as links to upgrade from head here:

https://esxi-patches.v-front.de/ESXi-6.0.0.html

For a quick upgrade directly from the VMware online host update repository you can do the following on each host in the cluster after putting them into VSAN Maintenance Mode. Note that there are also some advanced settings that are recommended as part of the VSAN Health Checks in 6.2

After rolling through each host in the cluster make sure that you have an updated copy of the VSAN HCL and run a health check to see where you stand. You should see a warning about the disks needing an upgrade and if any hosts didn’t have the above advanced settings applied you will have a warning about that as well.

Expanding VSAN Cluster:

As part of this upgrade I am also adding an additional host to the existing three to expand to a four host cluster. I am doing this for a couple of reasons, not withstanding the accepted design position on four host being better than three from a data availability point of view you also need a minimum of four hosts if you want to enable RAID5 erasure coding (six is required as a minimum for RAID6). The addition of the fourth host also allowed me to roll through the Hybrid to AF upgrade with a lot more headroom.

Before adding the new host to the existing cluster you need to ensure that the build is consistent with the existing hosts in terms of versioning and more importantly networking. Ensure that you have configured an VMkernel Interface for VSAN traffic and marked it as such through the Web Client. If you don’t do this prior to putting the host into the existing cluster I found that the management VMKernel interface was enabled by default for VSAN.

If you notice below this cluster is also NSX enabled, hence the events relating to Virtual NICs being added. Most importantly the host can see other hosts in the cluster and is enabled for HA.

Once in the cluster the host can be used for VM placement with data served from the existing hosts with configured disk groups over the VSAN network.

Upgrade License:

At this point I upgraded the licenses to enable the new features in VSAN 6.2. As a refresher on VSAN licensing there are three editions with the biggest change from previous versions being that to get the Deduplication and Compression, Erasure Coding and QoS features you need to be running All Flash and have an Enterprise license key.

To upgrade the license you need to head to Licensing under the Configuration section of the Manage Tab whilst the Cluster is selected. Apply the new license and you should see the following.

Upgrade Disk Format:

If you have read up around upgrading VSAN you know that there is a disk format upgrade required to get the benefits of the newer versions. Once you have upgraded both vCenter and Hosts to 6.0.0 Update 2 if you check the VSAN Health under the Monitor Tab of the Cluster you should see an failure talking about v2 disks not working with v3 disks as shown below.

You can click on the Upgrade On-Disk Format button here to kick off the process. This can also be triggered from the Disk Management section under the Virtual San menu in the Manage cluster section of the Web Client. Once triggered you will see some events trigger and an update in progress message near the version number.

Borrowing from one of Cormac Hogan’s posts on VSAN 6.2 the following explains what is happening during the disk format upgrade. Also described in the blog post is a way using the Ruby vSphere Client to monitor the progress in more detail.

There are a few sub-steps involved in the on-disk format upgrade. First, there is the realignment of all objects to a 1MB address space. Next, all vsanSparse objects (typically used by snapshots) are aligned to a 4KB boundary. This will bring all objects to version 2.5 (an interim version) and readies them for the on-disk format upgrade to V3. Finally, there is the evacuation of components from a disk groups, then the deletion of said disk group and finally the recreation of the disk group as a V3. This process is then repeated for each disk group in the cluster, until finally all disks are at V3.

As explained above the upgrade can take a significant amount of time depending on the amount of disk groups, data consumed on your VSAN datastore as well as the type of disks being used (SAS based vs SATA/NL-SAS) Once complete you should have a green tick and the On-Disk format version reporting 3.0

With that done we can move ahead to the Hybrid to All Flash conversion. For details on the look out for Part 2 of this series coming soon.

References:

Hybrid vs All-flash VSAN, are we really getting close?

VSAN 6.2 Part 2 – RAID-5 and RAID-6 configurations

VSAN 6.2 Part 12 – VSAN 6.1 to 6.2 Upgrade Steps

« Older Entries