Tag Archives: vSphere

Quick Fix: Deploying Multiple Ubuntu 18.04 VMs From Template with DHCP Results Same in IP Allocation

In the continuing work I’ve been doing with Terraform, i’ve come across a number of gotchyas when working with VM Templates and deploying them on mass. The nature of the work is that i’m creating and destroying VMs often. Generally speaking I like using Static IP addresses but for the project i’m working on I needed to be able to have an option to deploy and configure the networking with DHCP. Windows and CentOS gave me no issues, however when I went to deploy the Ubuntu 18.04 template I started getting errors on the plan execution.

When I looked at the output of the Terraform where export the VM IP addresses, the json output showed that all the cloned VMs had been assigned the same IP address.

At first I assumed it was due to the same MAC address being assigned by ESXi to the cloned VMs which was resulting in the machines being allocated the same IP, however when I checked the MAC addresses they where all different.

What is Machine-ID:

After some digging online I came across a change in behaviour where Ubuntu uses the machine-id to request DHCP addresses. Ubuntu server default networking goes through cloud-init which by default sends /etc/machine-id in the DHCP request. This leads to the duplicate IP situation.

The /etc/machine-id file contains the unique machine ID of the local system that is set during installation or boot. The machine ID is a single newline-terminated, hexadecimal, 32-character, lowercase ID. When decoded from hexadecimal, this corresponds to a 16-byte/128-bit value. This ID may not be all zeros.

The machine ID is usually generated from a random source during system installation or first boot and stays constant for all subsequent boots. Optionally, for stateless systems, it is generated during runtime during early boot if necessary.

Quick Fix:

From a template perspective there is a quick fix that can be applied where the machine-id file is blanked out. This means upon first boot a new ID is generated. You can’t just delete the machine-id file as it needs to exist. If it doesn’t exist the deployment will fail as it expects it to be there in some form.

The simplest way I achieved this was by zero’ing out the file:

Once done, the VM can be saved again as a template and the cloning operation will result in unique IPs being handed out by the DHCP server.

References:

http://manpages.ubuntu.com/manpages/bionic/man5/machine-id.5.html

https://www.freedesktop.org/software/systemd/man/machine-id.html

 

Using Variable Maps to Dynamically Deploy vSphere VMs with Terraform

I’ve been working on a project over the last couple of weeks that has enabled me to sharpen my Terraform skills. There is nothing better than learning by doing and there is also nothing better than continuously improving code through more advanced constructs and methods. As this project evolved it became apparent that I would need to be able to optimize the Terraform/PowerShell to more easily deploy VMs based on specific vSphere templates.

Rather than have one set of Terraform declarations per template (resulting in a lot of code duplication), or having the declaration variables tied to specific operating systems changing (resulting in more manual change) depending on the what was being deployed, I looked for a way to make it even more “singularly declarative”.

Where this became very handy was when I was looking to deploy VMs based on Linux Distro. 98% of the Terraform code is the same no matter if Ubuntu, or CentOS was being used. The only difference was the vSphere Template being used to clone the new VM from, the Template Password and also, in this case a remote-exec call that needed to be made to open a Firewall port.

To get this working I used Terraform variable maps. As you can see below, the idea behind using maps is to allow groupings of like variables in one block declaration. These map values are then fed through to the rest of the Terraform code. Below is an example of a maps.tf file that I have seperate to the variables.tf file. This was an easier way to logically seperate what was being configured using maps.

At the top I have a standard variable that is the only variable that changes and needs setting. If ubuntu is set as the vsphere_linux_distro then all the map values that are ubuntu = will be used. The same if that variable is set to centos.

This is set in the terraform.tfvars file, and links back to the mappings. From here Terraform will lookup the linux_template variable and map it with the various mapped values.

The above data source that dictates what template is used builds the name from the lookup function of the base variable and the map value.

Above, the values we set in the maps are being used to execute the right command depending if it is Ubuntu or Centos and also to use the correct password depending on the linux_distro set. As mentioned, the declared variable can either be set in the terraform.tfvars file, or passed through at the time the plan is executed.

The result is a more controlled and easily managed way to use Terraform to deploy VMs from different pre-existing VM templates. The variable mappings can be built up over time and used as a complete library or different operating systems with different options. An other awesome feature of Terraform!

References:

https://www.terraform.io/docs/configuration-0-11/variables.html#maps

Assigning vSphere Tags with Terraform for Policy Based Backups

vSphere Tags are used to add attributes to VMs so that they can be used to help categorise VMs for further filtering or discovery. vSphere Tags have a number of use cases of which Melissa has a great blog post here on the power of vSphere Tags, their configuration and their application. Veeam fully supports the use of vSphere Tags when configuring Backup or Replication Jobs. The use of tags essentially transforms static jobs into dynamic policy based management for backup and replication.

Once a job is set to build its VM inventory from Tags there is almost no need to go back and amend the job settings to cater for VMs that are added or removed from vCenter. . Shown above, I have a Tag Category configured with two tags that are used to set a VM to be included or excluded in the backed job. Every time the job is run it will source the VM list based on these policy elements resulting in less management overheads and as a way to capture changes to the VM inventory.

vSphere Tags with Terraform:

I’ve been deploying a lot of lab VMs using Terraform of late. The nature of these deployments means that VMs are being created and destroyed often. I was finding that VMs that should be backed up where not being backed up, while VMs that shouldn’t be backed up where being backed up. This also leads to issues with the backup job…an example was this week, when I was working on my Kubernetes vSphere Terraform project.

The VMs where present at the start of the backup, but during the window the VMs had been destroyed leaving the job in an error state. These VMs being transient in nature should never have been part of the job. With the help of the tags I created above I was able to use Terraform to assign those tags to VMs created as part of the plan.

With Terraform you can create Tag Categories and Tags as part of the code. You can also leverage existing Tag Categories and Tags and feed that into the declarations as variables. For backup purposes, every VM that I create now has one of the two tags assigned to it. Outside of Terraform, I would apply this from the Web Client or via PowerShell, but the idea is to ensure a repeatable, declarative VM state where any VM created with Terraform has a tag applied.

Terraform vSphere Tag Configuration Example:

First step is to declare two data sources somewhere in the TF code. I typically place these into a main.tf file.

We have the option to hard code the names of the Tag and Tag Category in the data source, but a better way is to use variables for maximum portability.

The terraform.tfvars file is where we can set the variable

We also need to created a corresponding entry in the variables.tf

Finally we can set the tag information in the VM .tf file that references the data sources, that in turn reference the variables that have been configured.

The Result:

Once the Terraform plan has been applied and the VMs created the Terraform State file will contain references to the tags and the output from the running of the plan will show it assigned to the VM.

The Tag will be assigned to the VM and visible as an attribute in vCenter.

Any Veeam Backup Job that is configured to use Tags will now dynamically add or exclude VMs created by Terraform plan. In the case above, the VM has the TPM03-NO-BACKUP tag assigned which means it will be part of the exclusion list for the backup job.

Conclusion:

vSphere Tags are an excellent way to configure policy based backup and replication jobs through Veeam. Terraform is great for deploying infrastructure in a repeatable, declarative way. By having Terraform assign Tags to VMs as they are deployed allows us to control whether a VM is included or excluded from a backup policy. If deploying VMs from Terraform, take advantage of vSphere Tags and have them as part of your deployments.

References:

https://www.terraform.io/docs/providers/vsphere/r/tag.html

Deploying a Kubernetes Sandbox on VMware with Terraform

Terraform from HashiCorp has been a revelation for me since I started using it in anger last year to deploy VeeamPN into AWS. From there it has allowed me to automate lab Veeam deployments, configure a VMware Cloud on AWS SDDC networking and configure NSX vCloud Director Edges. The time saved by utilising the power of Terraform for repeatable deployment of infrastructure is huge.

When it came time for me to play around with Kubernetes to get myself up to speed with what was happening under the covers, I found a lot of online resources on how to install and configure a Kubernetes cluster on vSphere with a Master/Node deployment. I found that while I was tinkering, I would break deployments which meant I had to start from scratch and reinstall. This is where Terraform came into play. I set about to create a repeatable Terraform plan to deploy the required infrastructure onto vSphere and then have Terraform remotely execute the installation of Kubernetes once the VMs had been deployed.

I’m not the first to do a Kubernetes deployment on vSphere with Terraform, but I wanted to have something that was simple and repeatable to allow quick initial deployment. The above example uses KubeSpray along with Ansible with other dependancies. What I have ended up with is a self contained Terraform plan that can deploy a Kubernetes sandbox with Master plus a dynamic number of Nodes onto vSphere using CentOS as the base OS.

I haven’t automated is the final step of joining the nodes to the cluster automatically. That step takes a couple of seconds once everything else is deployed. I also haven’t integrated this with VMware Cloud Volumes and prepped for persistent volumes. Again, the idea here is to have a sandbox deployed within minutes to start tinkering with. For those that are new to Kubernetes it will help you get to the meat and gravy a lot quicker.

The Plan:

The GitHub Project is located here. Feel free to clone/fork it.

In a nutshell, I am utilising the Terraform vSphere Provider to deploy a VM from a preconfigured CentOS template which will end up being the Kubernetes Master. All the variables are defined in the terraform.tfvars file and no other configuration needs to happen outside of this file. Key variables are fed into the other tf declarations to deploy the Master and the Nodes as well as how to configure the Kubernetes cluster IP networking.

[Update] – It seems as though Kubernetes 1.16.0 was released over the past couple of days. This resulted in the scripts not installing the Master Node correctly due to an API issue when configuring the POD networking. Because of that i’ve updated the code to now use a variable that specifies the Kubernetes version being installed. This can be found on Line 30 of the terraform.tfvars. The default is 1.15.3.

The main items to consider when entering in your own variables for the vSphere environment is to look at Line 18, and then Line 28-31. Line 18 defines the Kubernetes POD network which is used during the configuration and then Line 28-31 sets the number of nodes, the starting name for the VM and then uses two seperate variables to build out the IP addresses of the nodes. Pay attention to the format here of the network on Line 30 and then choose the starting IP for the Nodes on Line 31. This is used as a starting IP for the Node IPs and is enumerated in the code using the Terraform Count construct. 

By using Terraforms remote-exec provisioner, I am then using a combination of uploaded scripts and direct command line executions to configure and prep the Guest OS for the installation of Docker and Kubernetes.

You can see towards the end I have split up the command line scripts to ensure that the dynamic nature of the deployment is attained. The remote-exec on Line 82 pulls in the POD Network Variable an executes it inline. The same is done for Line 116-121 which configures the Guest OS hosts file to ensure name resolution. They are used together with two other scripts that are uploaded and executed.

The scripts have been build up from a number of online sources that go through how to install and configure Kubernetes manually. For the networking, I went with Weave Net after having a few issues with Flannel. There are lots of other networking options for Kubernetes… this is worth a read.

For better DNS resolution on the Guest OS VMs, the hosts file entries are constructed from the IP address settings set in the terraform.tfvars file.

Plan Execution:

The Nodes can be deployed dynamically using a Terraform var option when applying the plan. This allows for zero to as many nodes as you want for the sandbox… though three seems to be a nice round number.

The number of nodes can also be set in the terraform.tfvars file on Line 28. The variable set during the apply will take precedence over the one declared in the tfvars file. One of the great things about Terraform is we can alter the variable either way which will end up with nodes being added or removed automatically.

Once applied, the plan will work through the declaration files and the output will be similar to what it shown below. You can see in just over 5 minutes we have deployed one Master and three Nodes ready for further config.

The next step is to use the kubeadm join command on the nodes. For those paying attention the complete join command was outputted via the Terraform apply. Once applied on all nodes you should have a ready to go Kubernetes Cluster running on CentOS ontop of vSphere.

Conclusion:

While I do believe that the future of Kubernetes is such that a lot of the initial installation and configuration will be taken out of our hands and delivered to us via services based in Public Clouds or through platforms such as VMware’s Project Pacific having a way to deploy a Kubernetes cluster locally on vSphere is a great way to get to know what goes into making a containerisation platform tick.

Build it, break it, destroy it and then repeat… that is the beauty of Terraform!

References:

https://github.com/anthonyspiteri/terraform/tree/master/deploy_kubernetes_CentOS

 

Quick Fix – ESXi loses all Network Configuration… but still runs?

I had a really strange situation pop up in one of my lab environments over the weekend. vSAN Health was reporting that one of the hosts had lost networking connectivity to the rest of the cluster. This is something i’ve seen intermittently at times so waited for the condition to clear up. When it didn’t clear up, I went to look at the host to put it into maintenance mode, but found that I wasn’t getting the expected vSAN options.

I have seen situations recently where the enable vSAN option on the VMkernel interface had been cleared and vCenter thinks there are networking issue. I thought maybe it was this again. Not that that situation in its self was normal, but what I found when I went to view the state of the VMkernel Adapters from the vSphere Web Client was even stranger.

No adapters listed!

The host wasn’t reported as being disconnected and there was still connectivity to it via the Web UI and SSH. To make sure this wasn’t a visual error from the Web Client I SSH’ed into the host and ran esxcli to get a list of the VMkernel interfaces.

Unable to find vmknic for dvsID: xxxxx

So from the cli, I couldn’t get a list of interfaces either. I tried restarting the core services without luck and still had a host that was up with VMs running on it without issue, yet reporting networking issues and having no network interfaces configured per the running state.

Going to the console… the situation was not much better.

Nothing… no network or host information at all 🙂

Not being bale to reset the management network my only option from here was to reboot the server. Upon reboot the host did come back up online, however the networking was reporting as being 0.0.0.0/0 from the console and now the host was completely offline.

I decided to reboot using last know good configuration as shown below:

Upon reboot using the last known good configuration all previous network settings where restored and I had a list of VMkernel interfaces again present from the Web Client and from the cli.

Because of the “dirty” vSAN reboot, as is usual with anything that disrupts vSAN, the cluster needed some time to get its self back into working order and while some VMs where in an orphaned or unavailable state after reboot, once the vSAN re-sync had completed all VMs where back up and operational.

Cause and Offical Resolution:

The workaround to bring back the host networking seemed to do the trick however I don’t know what the root cause was for the host to lose all of its network config. I have an active case going with VMware Support at the moment with the logs being analysed. I’ll update this post with the results when they come through.

ESXi Version: 6.7.0.13006603
vSphere Version: 6.7.0.30000
NSX-v: 6.4.4.11197766

AWS Outposts and VMware…Hybridity Defined!

Now that AWS re:Invent 2018 has well and truly passed…the biggest industry shift to come out of the event from my point of view was the fact that AWS are going full guns blazing into the on-premises world. With the announcement of AWS Outposts the long held belief that the public cloud is the panacea of all things became blurred. No one company has pushed such a hard cloud only message as AWS…no one company had the power to change the definition of what it is to run cloud services…AWS did that last week at re:Invent.

Yes, Microsoft have had the Azure Stack concept for a number of years now, however they have not executed on the promise of that yet. Azure Stack is seen by many as a white elephant even though it’s now in the wild and (depending on who you talk to) doing relatively well in certain verticals. The point though is that even Microsoft did not have the power to make people truely believe that a combination of a public cloud and on premises platform was the path to hybridity.

AWS is a Juggernaut and it’s my belief that they now have reached an inflection point in mindshare and can now dictate trends in our industry. They had enough power for VMware to partner with them so VMware could keep vSphere relevant in the cloud world. This resulted in VMware Cloud on AWS. It seems like AWS have realised that with this partnership in place, they can muscle their way into the on-premises/enterprise world that VMware have and still dominate…at this stage.

Outposts as a Product Name is no Accident

Like many, I like the product name Outposts. It’s catchy and straight away you can make sense of what it is…however, I decided to look up the offical meaning of the word…and it makes for some interesting reading:

  • An isolated or remote branch
  • A remote part of a country or empire
  • A small military camp or position at some distance from the main army, used especially as a guard against surprise attack

The first definition as per the Oxford Dictionary fits the overall idea of AWS Outposts. Putting a compute platform in an isolated or remote branch office that is seperate to AWS regions while also offering the ability to consume that compute platform like it was an AWS region. This represents a legitimate use case for Outposts and can be seen as AWS fulling a gap in the market that is being craved for by shifting IT sentiment.

The second definition is an interesting one when taken in the context of AWS and Amazon as a whole. They are big enough to be their own country and have certainly built up an empire over the last decade. All empires eventually crumble, however AWS is not going anywhere fast. This move does however indicate a shift in tactics and means that AWS can penetrate the on-premises market quicker to extend their empire.

The third definition is also pertinent in context to what AWS are looking to achieve with Outposts. They are setting up camp and positioning themselves a long way from their traditional stronghold. However my feeling is that they are not guarding against an attack…they are the attack!

Where does VMware fit in all this?

Given my thoughts above…where does VMware fit into all this? At first when the announcement was made on stage I was confused. With Pat Gelsinger on stage next to Andy Jessy my first impression was that VMware had given in. Here was AWS announcing a direct competitive platform to on-premises vSphere installations. Not only that, but VMware had announced Project Dimension at VMworld a few months earlier which looked to be their own on-premises managed service offering…though the wording around that was for edge rather than on-premises.

With the initial dust settled and after reading this blog post from William Lam, I came to understand the VMware play here.

VMware and Amazon are expanding their partnership to deliver a new, as-a-service, on-premises offering that will include the full VMware SDDC stack (vSphere, NSX, vSAN) running on AWS Outposts, a fully managed and configurable server and network installation built with AWS-designed hardware. VMware Cloud in AWS Outposts is VMware’s new As-a-Service offering in partnership with AWS to run on AWS Outposts – it will leverage the innovations we’ve developed with Project Dimension and apply them on top of AWS Outposts. VMware Cloud on AWS Outposts will be a subscription-based service and will support existing VMware payment options.

The reality is that on-premises environments are not going away any time soon but customers like the operating model of the cloud. More and more they don’t care about where infrastructure lives as long as a services outcome is achieved. Customers are after simplicity and cost efficiency. Outposts delivers all this by enabling convenience and choice…the choice to run VMware for traditional workloads using the familiar VMware SDDC stack all while having access to native AWS services.

A Managed Service Offering means a Mind shift

The big shift here from VMware that began with VMware Cloud on AWS is a shift towards managed services. A fundamental change in the mindset of the customer in the way in which they consume their infrastructure. Without needing to worry about the underlying platform, IT can focus on the applications and the availability of those applications. For VMware this means from the VM up…for AWS, this means from the platform up.

VMware Cloud on AWS is a great example of this new managed services world, with VMware managing most of the traditional stack. VMware can now extend VMware Cloud on AWS to Outposts to boomerang the management of on-premises as well. Overall Outposts is a win win for both AWS and VMware…however proof will be in the execution and uptake. We won’t know how it all pans out until the product becomes available…apparently in the later half of 2019.

IT admins have some contemplating to do as well…what does a shift to managed platforms mean for them? This is going to be an interesting ride as it pans out over the next twelve months!

References:

VMware Cloud on AWS Outposts: Cloud Managed SDDC for your Data Center

vSphere 6.7 Update 1 – Top New Features and Platform Supportability

Last week VMware released vSphere 6.7 Update 1. While the buzz around this release was less than the previous release it still contains a ton of enhancements for vCenter, ESXi and vSAN. Like 6.7 before it, this is a lot more than a point release and represents a significant upgrade from vSphere 6.7.

Looking through the release notes, there appears to be less for service providers in this release though I still feel like it’s important to highlight the base hypervisor (ESXi) as well as the management platform (vCenter). vSAN has had another significant update and that will warrant a post on it’s on. I’ll also talk about current interoperability with vCloud Director and NSX as well as current Veeam supportability for vSphere 6.7 Update 1 as well as touch on Veeam’s current supportability.

  • New (almost 100%) Fully functional HTML5 client
  • Upgrade path from vSphere 6.5 U2 to vSphere 6.7 Update 1
  • Enhanced support for NVIDIA Quadro vDWS VMs and support for Intel FPGA
  • New vCenter Convergence Tool
  • Updated vSAN
  • Enhanced vSphere Content Library
Fully Functional HTML5 Client

Most functions have now been ported across to the HTML5 vSphere Client. This results in administrators not having to switch back and forth between the FLEX Web Client and the HTML5 client. Update 1 features:

  • vCenter High Availability (VCHA)
  • Auto Deploy
  • Host Profiles
  • vSphere Update Manager
  • Network Topology Diagrams
  • Performance Charts
  • Improved Searching
  • Dark Theme

Emad Younis has a detailed post here that goes through the new features.

Upgrade Path from vSphere 6.5 Update 2 to vSphere 6.7 Update 1

One of the issues with vSphere 6.7 was the fact that the vSphere 6.5 Update 2 release would not be able to be upgraded to vSphere 6.7.  With the release of vSphere 6.7 Update 1. vSphere 6.5 Update 2 to vSphere 6.7 Update 1 is now a fully supported.

Enhanced Content Library

New improvements to the content library in vSphere 6.7 Update 1 enables the importing of OVA templates from a HTTPS endpoint and also local storage.  Importing now verifies the certificate of the OVA bundle and also now natively supports VM templates (VMTX) and associated operations such as deploying a VM directly from Content Library.

vCenter Specific Enhancements

With vCenter Server 6.7 Update 1, you can move a vCenter Server with an Embedded Platform Services Controller from one vSphere domain to another vSphere domain. Services such as tagging and licensing are retained and migrated to the new domain.

There is a new Burst Filter to manage event bursts and prevent the database of vCenter Server from flooding with identical events over a short period of time.

vCenter Server 6.7 Update 1 supports VMware vSphere vMotion between on-premises vCenter’s and VMware Cloud on AWS. You can use either the vSphere Client or vSphere Web Client, or the API. Both sides need to be at 6.7 Update 1.

you can import Open Virtual Appliance (OVA) files in a Content Library. The OVA files are unzipped during the import, providing manifest and certificate validations, and create an OVF library item that enables deployment of virtual machines from a Content Library.

With vCenter Server 6.7 Update 1, you can use the Appliance Management User Interface to configure and edit the firewall settings of the vCenter Server Appliance.

ESXi Specific Enhancements

There are a few vendor/hardware related features and enhancements in Update 1 for ESXi 6.7. The release notes cover them in detail here. But as mentioned above, probably the biggest addition here is the ability to upgrade from ESXi 6.5 Update 2 which I know a few service providers where stuck on. In terms of known issues the release notes also contain a good list. There are some here that impact Service Providers so it’s worth reading through them.

vCD and NSX Supportability:

Shifting from new features and enhancements to an important subject to talk about when talking service provider platform…VMware product compatibility. For those VCPP Service Providers running a Hybrid Cloud you should be running a combination of vCloud Director SP or/and NSX-v of which the 6.4.3 and 6.4.2 versions are supported at release. Most providers should be on these releases so that’s good news.

Looking at vCloud Director, it looks like 9.5 is the only supported version at the moment

Veeam Backup & Replication Supportability: 

Veeam commits to supporting major version releases within 90 days or sooner of GA. There has been many discussions going round whether an Update is a major release these days…and general consensus now is that VMware is releasing these updates with enough changes to potentially impact backup supportability.

So with that, those Service Provider that are also VCSPs using Veeam to backup their infrastructure should not upgrade to vSphere 6.7 until Backup & Replication Update 4 is released. For those that are bleeding edge and have updated your only is to go with the workaround that is detailed here. It works…but again, it’s a work around.

Wrapping Up:

Rounding off this post, in the Known Issues section there is a fair bit to be aware of for 6.7 Update 1. it’s worth reading through all the known issues just in case there are any specific issues that might impact you.

Happy upgrading!

References:

https://docs.vmware.com/en/VMware-vSphere/6.7/rn/vsphere-vcenter-server-671-release-notes.html

https://docs.vmware.com/en/VMware-vSphere/6.7/rn/vsphere-esxi-671-release-notes.html

vSphere 6.7 – What’s in it for Service Providers Part 1

A few weeks ago after much anticipation VMware released vSphere 6.7. Like 6.5 before it, this is a lot more than a point release and represents a major upgrade from vSphere 6.5. There is so much packed into this new release that there is an official page with separate blog posts talking about the features and enhancements. As usual, I will go through some of the key features and enhancements that are included in the latest versions of vCenter and ESXi and as they relate back to the Service Providers that use vSphere as the foundation of their Infrastructure as a Service offerings.

There is a lot go get through though and like the vSphere 6.5 release the “whats new” will not fit into one post so i’ll split the highlights between a couple posts and I’ll cover ESXi specifically in a follow-up. I still feel like it’s important to highlight the base hypervisor as well as the management platform. I’ll also talk about current interoperability with vCloud Director and NSX as well as Veeam supportability for vSphere 6.7.

The major features and enhancements as listed in the What’s New PDF are:

  • Scalability Enhancements
  • VMware vCenter Server Appliance Linked Mode
  • VMware vCenter Server Appliance Back Up Scheduler
  • Single Reboot
  • Quick Boot
  • Support for 4K Native Storage
  • Improved HTML 5 based vSphere Client
  • Security-at-Scale
  • Support for Trusted Platform Module (TPM) 2.0 and virtual TPM
  • Cross-vCenter Encrypted vMotion
  • Support for Microsoft’s Virtualization Based Security (VBS)
  • NVIDIA GRID vGPU Enhancements
  • vSphere Persistent Memory
  • Hybrid Linked Mode
  • Per-VM Enhanced vMotion Compatibility (EVC)
  • Cross-vCenter Mixed Version Provisioning – Simplify provisioning across hybrid cloud environments that have diferent vCenter versions

Below are the ones in red fleshed out in the context of Service Providers.

Enhanced vCenter Server Appliance:

The VCSA has been enhanced significantly in this release. Having used the VCSA exclusively for the past year in all my environments I have a love hate relationship with it. I still feel it’s nowhere as stable as vCenter running ontop of Windows and is prone to more issues than a Windows based vCenter…however this 6.7 release will be the last one supporting or offering a Windows based vCenter. With that VMware have had to work hard on making the VCSA more resilient.

Compared to the 6.5 VCSA, 6.7 offers twice the performance in vCenter operations per second with a three times reduction in memory usage and three times faster DRS operations meaning that power on and other VM operations are performed quicker. This is great on a service provider platform with potentially lots of those operations happening during the course of a day. Hopefully this improves the responsiveness overall of the VCSA which I have felt at times to be poor under load or after an extended period of appliance uptime.

There has also been a number of updates to the APIs offered in vSphere, the VCSA and ESXi. William Lam has a great post on what’s new for APIs here, but all Service Providers should have teams looking at the API Explorer as it’s a great way to explore and learn what’s available.

Single Reboot and Quick Reboot:

For Service Providers who need upgrade their platforms to maintain optimal compatibility, upgrading hosts can be time consuming at scale. vSphere 6.7 reduces ESXi host upgrades, by eliminating one of the two reboots normally required for major version upgrades. This is the single reboot feature. There is also vSphere Quick Boot that restarts the ESXi hypervisor without rebooting the physical host. This skips time-consuming server hardware initialization and post boot operation wait times. Both of these significantly reduce maintenance times.

This blog post covers both features in more detail.

Improved HTML 5 based vSphere Client:

While minor in terms of actual under the hood improvements, the efficiencies that are gained when it comes to a decent user interface are significant. When managing Service Provider platforms at scale, having a reliable client is important and with the decommissioning of the VI client and the often frustrating performance of the Flex client a near complete and workable HTML vSphere Client is a big plus for those who work day to day on vCenter.

The vSphere 6.7 vSphere Client has support for vSAN as well as having Update Manager fully built in. As per the last NSX 6.4 update there is also limited management of that. There is also a new vROps plugin…this plugin is available out-of-the-box once vROps has been linked with vCenter and offers dashboards directly in the vSphere client that can view, cluster view, and alerts for both vCenter and vSAN views. This is extremely handy for Service Providers who use vROps dashboard not needing to go to two different locations to get the info.

vCD and NSX Supportability:

Shifting from new features and enhancements to an important subject to talk about when talking service provider platform…VMware product compatibility. For those VCPP Service Providers running a Hybrid Cloud you should be running a combination of vCloud Director SP or/and NSX-v of which, at the moment there is no support for either in vSphere 6.7.

Looking at vCloud Director, it looks like 9.1 is supported however given the fact you need to be running NSX-v with vCD these days and NSX is not yet supported, it doesn’t make too much sense to suggest that there is total compatability.

I suspect we will see NSX-v come out with a supported build shortly…though I’m only expecting vCloud Director SP to support 6.7 form version 9.1 which will mean upgrades.

Veeam Backup & Replication Supportability: 

Veeam commits to supporting major version releases within 90 days or sooner of GA. So with that, those Service Provider that are also VCSPs using Veeam to backup their infrastructure should not upgrade to vSphere 6.7 until Backup & Replication Update 3a is released. For those that are bleeding edge and have updated your only option at that point is our Agents for Windows and Linux until Update 3a is released.

Wrapping up Part 1:

Rounding off this post, in the Known Issues section there is a fair bit to be aware of for 6.7. it’s worth reading through all the known issues just in case there are any specific issues that might impact you. In upcoming posts around vSphere 6.7 for Service Providers series I will cover more vCenter features as well as ESXi enhancements and what’s new in Core Storage.

Happy upgrading!

References:

https://docs.vmware.com/en/VMware-vSphere/6.7/rn/vsphere-esxi-vcenter-server-67-release-notes.html

https://blogs.vmware.com/vsphere/2018/04/introducing-faster-lifecycle-management-operations-in-vmware-vsphere-6-7.html

vSphere 6.5 Update 1 – What’s in it for Service Providers

Late last week VMware released vSphere 6.5 Update 1 which included updated builds of both vCenter and ESXi and as per usual I will go through some of the key features and fixes that are included in the latest versions of vCenter and ESXi. When looking through the release notes I generally keep an eye out for improvements that relate back to Service Providers who use vSphere as the foundation of their Managed or Infrastructure as a Service offerings. This update also contains an update to vSAN which is now at 6.6.1 so I’ll spend some time looking at what’s been added there.

 

New Features and Enhancements:

Without question this is a significant patch release for vCenter and ESXi and the length of the release notes is testament to that point. In terms of new features there isn’t anything groundbreaking but there are a few nice additions like being able to run the VCSA GUI and CLI installers on Windows 2012 and 2012 R2 as well as 2016 and macOS Sierra and Ubuntu 17.04 OS is supported for Guest OS Customization. vCenter now supports Microsoft SQL Server 2014 SP2 2016 and SP1 as well as some increased configuration maximums supporting Linked Mode with 15 vCenter Instances, 5000 ESXi hosts and 50,000 powered on virtual machines.

Ability to Upgrade or Migrate from vCenter 6.0 Update 3:

This release addresses the previous limitation in the upgrade and migration path for those running vSphere 6.0 U3 in going to vSphere 6.5. I know this will make a lot of providers happy as I know a lot that had to go to 6.0 Update 3 to address existing bug in the platform but where not yet ready or able to go to 6.5 at the time.

HTML5 Client Update:

The HTML5 Web Client has gotten it’s own update that brings it up to speed with the 3.15 Fligng version however it’s still partially functional which remains somewhat frustrating…The online documentation for supported functionality has been updated to vSphere 6.5U1 and is available here.

The list below is of the main updates in this release.

  • DRS/HA VM overrides
  • SDRS rules
  • Content Library – further actions
  • Roles and Global Permissions
  • Download multiple files as zip
  • Distributed Switch – further actions
  • Fault Tolerance
  • SPBM
  • VM Hardware – further items
  • Apply Customize Guest OS during Clone
  • VM Migration – further actions (compute+storage, Cross VC, batch)
vSAN Features:

For service providers, vSAN 6.6 was another major release that sured up vSANs status as a serious storage platform for service provider platforms.

vSAN 6.6.1 introduces three key new features:

  • VMware vSphere Update Manager (VUM) integration
  • Performance Diagnostics in vSAN Cloud Analytics
  • Storage Device Serviceability enhancement

The ability to upgrade with VUM is a nice touch and continues to improve on the usability and manageability of vSAN. For a full look at what’s new in this release for vSAN 6.6.1 head to this blog post.

Resolved Issues:

There are a bunch of resolved issues in this release and I’ve gone through the rather extensive list to pull out the biggest fixes that relate to my experience in service provider operations and have also extended this to include fixes that relate to backup operations. The majority of what I pick out related to storage, networking hosts and VM operations…the core of any platform, but even more important in the service provider world. The ones in red are specific fixes that relate to issues that iv’e come across…good to see them addressed!

vCenter:
  • First-boot failure occurs when upgrading from vSphere 5.5 or 6.0 to vSphere 6.5 on Windows If an older version of the OpеnSSL DLLs are installed, upgrading to vSphere 6.5 fails to run because the older DLL versions are loaded
  • Affinity rules configured on vCenter Server 5.5 can cause crashes after upgrading to vCenter Server 6.5 Migrating a VM with affinity rules configured while on vCenter Server 5.5 to a cluster that has affinity rules configured on vCenter Server 6.0 or 6.5 can cause vCenter Server to crash.
  • VM Snapshot Size (GB) alarm is not triggered after the VM is powered on. VM Snapshot Size (GB) alarm is reset if the virtual machine is shut down. Alarm fails to trigger after the VM is powered on. This issue occurs in alarms based on VM Snapshot (GB) and Vm Total Size on Disk because their status is altered when the power state of the VM is changed. This issue occurs because disk usage of a VM is the same regardless of the VM power state.
  • When you add ports to a vSphere Distributed Switch you get an error Because of a race condition, when you add ports to a vSphere Distributed Switch you get the error message: Cannot create a new port because number of ports exceeds 2147483647, maximum number of ports allowed on vDS.
  • A runtime exception “Unable to retrieve data about the distributed switch” might occur while upgrading vSphere Distributed Switch (vDS) from 5.0 to 6.5 version When you try to upgrade an existing distributed switch after the vCenter upgrade is completed, the runtime exception Unable to retrieve data about the distributed switch might occur in the wizard and the distributed switch cannot be upgraded. The exception is a result of unexpected value NULL for a LACP property of the distributed switch, instead of TRUE or FALSE, as LACP is not supported for the current version of vSphere Distributed Switch.
  • Host configuration might not be available after vCenter Server restarts After a vCenter Server restart, the host configuration might not be available if vCenter Server cannot communicate with the host. After connectivity is restored, the configuration becomes available.
  • OVF tool fails to upload OVF or OVA files larger than 10 GB If you use OVF tool fails to upload OVF or OVA files larger than 10 GB, the upload might fail.

ESXi:

  • Virtual machine crashes on ESXi 6.5 when multiple users log on to Windows Terminal Server VM Windows 2012 terminal server running VMware tools 10.1.0 on ESXi 6.5 stops responding when many users are logged in.vmware.log will show similar messages to2017-03-02T02:03:24.921Z| vmx| I125: GuestRpc: Too many RPCI vsocket channels opened.
    2017-03-02T02:03:24.921Z| vmx| E105: PANIC: ASSERT bora/lib/asyncsocket/asyncsocket.c:5217
    2017-03-02T02:03:28.920Z| vmx| W115: A core file is available in "/vmfs/volumes/515c94fa-d9ff4c34-ecd3-001b210c52a3/h8-
    ubuntu12.04x64/vmx-debug-zdump.001"
    2017-03-02T02:03:28.921Z| mks| W115: Panic in progress... ungrabbing 
  • An ESXi host might fail with purple diagnostic screen when collecting performance snapshots
    An ESXi host might fail with purple diagnostic screen when collecting performance snapshots with vm-support due to calls for memory access after the data structure has already been freed.An error message similar to the following is displayed:
  • Full duplex configured on physical switch may cause duplex mismatch issue with igb native Linux driver supporting only auto-negotiate mode for nic speed/duplex setting
    If you are using the igb native driver on an ESXi host, it always works in auto-negotiate speed and duplex mode. No matter what configuration you set up on this end of the connection, it is not applied on the ESXi side. The auto-negotiate support causes a duplex mismatch issue if a physical switch is set manually to a full-duplex mode.
  • An ESXi host might fail with a purple screen and a Spin count exceeded (refCount) – possible deadlock with PCPU error An ESXi host might fail with a purple screen and a Spin count exceeded (refCount) - possible deadlock with PCPU error, when you reboot the ESXi host under the following conditions:
    • You use the vSphere Network Appliance (DVFilter) in an NSX environment
    • You migrate a virtual machine with vMotion under DVFilter control
  • A Virtual Machine (VM) with e1000/e1000e vNIC might have network connectivity issues For a VM with e1000/e1000e vNIC, when the e1000/e1000e driver tells the e1000/e1000e vmkernel emulation to skip a descriptor (the transmit descriptor address and length are 0), a loss of network connectivity might occur.
  • An ESXi host might stop responding when you migrate a virtual machine with Storage vMotion between ESXi 6.0 and ESXi 6.5 hosts The vmxnet3 device tries to access the memory of the guest OS while the guest memory preallocation is in progress during the migration of virtual machine with Storage vMotion. This results in an invalid memory access and the ESXi 6.5 host failure.
  • Modification of IOPS limit of virtual disks with enabled Changed Block Tracking (CBT) fails with errors in the log files To define the storage I/O scheduling policy for a virtual machine, you can configure the I/O throughput for each virtual machine disk by modifying the IOPS limit. When you edit the IOPS limit and CBT is enabled for the virtual machine, the operation fails with an error The scheduling parameter change failed. Due to this problem, the scheduling policies of the virtual machine cannot be altered. The error message appears in the vSphere Recent Tasks pane.You can see the following errors in the /var/log/vmkernel.log file:2016-11-30T21:01:56.788Z cpu0:136101)VSCSI: 273: handle 8194(vscsi0:0):Input values: res=0 limit=-2 bw=-1 Shares=1000
    2016-11-30T21:01:56.788Z cpu0:136101)ScsiSched: 2760: Invalid Bandwidth Cap Configuration
    2016-11-30T21:01:56.788Z cpu0:136101)WARNING: VSCSI: 337: handle 8194(vscsi0:0):Failed to invert policy
  • When you hot-add an existing or new virtual disk to a CBT (Changed Block Tracking) enabled virtual machine (VM) residing on VVOL datastore, the guest operation system might stop responding When you hot-add an existing or new virtual disk to a CBT enabled VM residing on VVOL datastore, the guest operation system might stop responding until the hot-add process completes. The VM unresponsiveness depends on the size of the virtual disk being added. The VM automatically recovers once hot-add completes.
  • When you use vSphere Storage vMotion, the UUID of a virtual disk might change When you use vSphere Storage vMotion on vSphere Virtual Volumes storage, the UUID of a virtual disk might change. The UUID identifies the virtual disk and a changed UUID makes the virtual disk appear as a new and different disk. The UUID is also visible to the guest OS and might cause drives to be misidentified.
  • An ESXi host might become unresponsive if the VMFS-6 volume has no space for the journal When opening a VMFS-6 volume, it allocates a journal block. Upon successful allocation, a background thread is started. If there is no space on the volume for the journal, it is opened in read-only mode and no background thread is initiated. Any intent to close the volume, results in attempts to wake up a nonexistent thread. This results in the ESXi host failure.
  • SSD congestion might cause multiple virtual machines to become unresponsiv Depending on the workload and the number of virtual machines, diskgroups on the host might go into permanent device loss (PDL) state. This causes the diskgroups to not admit further IOs, rendering them unusable until manual intervention is performed.
  • Unable to collect vm-support bundle from an ESXi 6.5 host Unable to collect vm-support bundle from an ESXi 6.5 host because when generating logs in ESXi 6.5 by using the vSphere Web Client, the select specific logs to export text box is blank. The options: network, storage, fault tolerance, hardware etc. are blank as well. This issue occurs because the rhttpproxy port for /cgi-bin has a value different from 8303.This issue is resolved in this release.
  • vSphere Storage vMotion might fail with an error message if it takes more than 5 minutes The destination virtual machine of the vSphere Storage vMotion is incorrectly stopped by a periodic configuration validation for the virtual machine. vSphere Storage vMotion that takes more than 5 minutes fails with the The source detected that the destination failed to resume message.
    The VMkernel log from the ESXi host contains the message D: Migration cleanup initiated, the VMX has exited unexpectedly. Check the VMX log for more details.

vSAN:

  • Hosts in a vSAN cluster have high congestion which leads to host disconnects When vSAN components with invalid metadata are encountered while an ESXi host is booting, a leak of reference counts to SSD blocks can occur. If these components are removed by policy change, disk decommission, or other method, the leaked reference counts cause the next I/O to the SSD block to get stuck. The log files can build up, which causes high congestion and host disconnects.
  • vSAN cluster becomes partitioned after the member hosts and vCenter Server reboot If the hosts in a unicast vSAN cluster and the vCenter Server are rebooted at the same time, the cluster might become partitioned. The vCenter Server does not properly handle unstable vpxd property updates during a simultaneous reboot of hosts and vCenter Server.
  • Large File System overhead reported by the vSAN capacity monitor When deduplication and compression are enabled on a vSAN cluster, the Used Capacity Breakdown (Monitor > vSAN > Capacity) incorrectly displays the percentage of storage capacity used for file system overhead. This number does not reflect the actual capacity being used for file system activities. The display needs to correctly reflect the File System overhead for a vSAN cluster with deduplication and compression enabled.

It’s also worth reading through the Known Issues section as there is a fair bit to be aware of in Update 1 and that remain from the GA.

Happy upgrading!

References:

https://docs.vmware.com/en/VMware-vSphere/6.5/rn/vsphere-esxi-651-release-notes.html

https://docs.vmware.com/en/VMware-vSphere/6.5/rn/vsphere-vcenter-server-651-release-notes.html

https://blogs.vmware.com/vsphere/2017/07/second-vsphere-client-html5-update-in-vsphere-6-5u1.html

https://blogs.vmware.com/virtualblocks/2017/07/27/introducing-hci-powered-by-vsan-6-6-1/

ESXI 6.5 Storage Performance Issues Resolved in Update 1

I originally came across the issue of slow storage performance with the native vmw_ahci driver that comes bundled with ESXi 6.5 just as I was first playing with my SuperMicro SYS-5028D-TN4T in my homelab. After publishing a couple of posts about the workaround shortly afterwards the issue become quiet prevalent in the community and the post continues to get decent traffic, meaning that the issues impacted quiet a few people out there.

The good news is that with the release of vSphere 6.5 Update 1 there is a fix for the problem in the form of updated drivers for the AHCI module. William Lam has been quick to blog about the fix and if you had previously disabled the driver you will need to re-enable it.

This VMwareKB covers the specific patch as listed in the release notes:

No confirmation as of yet if it actually does the trick, but the release notes look promising as the assumption is that it will resolve the issues so that homelabbers and people using the driver in production systems can rest easy.

References:

https://docs.vmware.com/en/VMware-vSphere/6.5/rn/vsphere-esxi-651-release-notes.html

https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2149910

http://www.virtuallyghetto.com/2017/07/ahci-vmw_ahci-performance-issue-resolved-in-esxi-6-5-update-1.html

« Older Entries