Tag Archives: Deployment

Quick Fix: Deploying Multiple Ubuntu 18.04 VMs From Template with DHCP Results Same in IP Allocation

In the continuing work I’ve been doing with Terraform, i’ve come across a number of gotchyas when working with VM Templates and deploying them on mass. The nature of the work is that i’m creating and destroying VMs often. Generally speaking I like using Static IP addresses but for the project i’m working on I needed to be able to have an option to deploy and configure the networking with DHCP. Windows and CentOS gave me no issues, however when I went to deploy the Ubuntu 18.04 template I started getting errors on the plan execution.

When I looked at the output of the Terraform where export the VM IP addresses, the json output showed that all the cloned VMs had been assigned the same IP address.

At first I assumed it was due to the same MAC address being assigned by ESXi to the cloned VMs which was resulting in the machines being allocated the same IP, however when I checked the MAC addresses they where all different.

What is Machine-ID:

After some digging online I came across a change in behaviour where Ubuntu uses the machine-id to request DHCP addresses. Ubuntu server default networking goes through cloud-init which by default sends /etc/machine-id in the DHCP request. This leads to the duplicate IP situation.

The /etc/machine-id file contains the unique machine ID of the local system that is set during installation or boot. The machine ID is a single newline-terminated, hexadecimal, 32-character, lowercase ID. When decoded from hexadecimal, this corresponds to a 16-byte/128-bit value. This ID may not be all zeros.

The machine ID is usually generated from a random source during system installation or first boot and stays constant for all subsequent boots. Optionally, for stateless systems, it is generated during runtime during early boot if necessary.

Quick Fix:

From a template perspective there is a quick fix that can be applied where the machine-id file is blanked out. This means upon first boot a new ID is generated. You can’t just delete the machine-id file as it needs to exist. If it doesn’t exist the deployment will fail as it expects it to be there in some form.

The simplest way I achieved this was by zero’ing out the file:

Once done, the VM can be saved again as a template and the cloning operation will result in unique IPs being handed out by the DHCP server.

References:

http://manpages.ubuntu.com/manpages/bionic/man5/machine-id.5.html

https://www.freedesktop.org/software/systemd/man/machine-id.html

 

Deploying a Kubernetes Sandbox on VMware with Terraform

Terraform from HashiCorp has been a revelation for me since I started using it in anger last year to deploy VeeamPN into AWS. From there it has allowed me to automate lab Veeam deployments, configure a VMware Cloud on AWS SDDC networking and configure NSX vCloud Director Edges. The time saved by utilising the power of Terraform for repeatable deployment of infrastructure is huge.

When it came time for me to play around with Kubernetes to get myself up to speed with what was happening under the covers, I found a lot of online resources on how to install and configure a Kubernetes cluster on vSphere with a Master/Node deployment. I found that while I was tinkering, I would break deployments which meant I had to start from scratch and reinstall. This is where Terraform came into play. I set about to create a repeatable Terraform plan to deploy the required infrastructure onto vSphere and then have Terraform remotely execute the installation of Kubernetes once the VMs had been deployed.

I’m not the first to do a Kubernetes deployment on vSphere with Terraform, but I wanted to have something that was simple and repeatable to allow quick initial deployment. The above example uses KubeSpray along with Ansible with other dependancies. What I have ended up with is a self contained Terraform plan that can deploy a Kubernetes sandbox with Master plus a dynamic number of Nodes onto vSphere using CentOS as the base OS.

I haven’t automated is the final step of joining the nodes to the cluster automatically. That step takes a couple of seconds once everything else is deployed. I also haven’t integrated this with VMware Cloud Volumes and prepped for persistent volumes. Again, the idea here is to have a sandbox deployed within minutes to start tinkering with. For those that are new to Kubernetes it will help you get to the meat and gravy a lot quicker.

The Plan:

The GitHub Project is located here. Feel free to clone/fork it.

In a nutshell, I am utilising the Terraform vSphere Provider to deploy a VM from a preconfigured CentOS template which will end up being the Kubernetes Master. All the variables are defined in the terraform.tfvars file and no other configuration needs to happen outside of this file. Key variables are fed into the other tf declarations to deploy the Master and the Nodes as well as how to configure the Kubernetes cluster IP networking.

[Update] – It seems as though Kubernetes 1.16.0 was released over the past couple of days. This resulted in the scripts not installing the Master Node correctly due to an API issue when configuring the POD networking. Because of that i’ve updated the code to now use a variable that specifies the Kubernetes version being installed. This can be found on Line 30 of the terraform.tfvars. The default is 1.15.3.

The main items to consider when entering in your own variables for the vSphere environment is to look at Line 18, and then Line 28-31. Line 18 defines the Kubernetes POD network which is used during the configuration and then Line 28-31 sets the number of nodes, the starting name for the VM and then uses two seperate variables to build out the IP addresses of the nodes. Pay attention to the format here of the network on Line 30 and then choose the starting IP for the Nodes on Line 31. This is used as a starting IP for the Node IPs and is enumerated in the code using the Terraform Count construct. 

By using Terraforms remote-exec provisioner, I am then using a combination of uploaded scripts and direct command line executions to configure and prep the Guest OS for the installation of Docker and Kubernetes.

You can see towards the end I have split up the command line scripts to ensure that the dynamic nature of the deployment is attained. The remote-exec on Line 82 pulls in the POD Network Variable an executes it inline. The same is done for Line 116-121 which configures the Guest OS hosts file to ensure name resolution. They are used together with two other scripts that are uploaded and executed.

The scripts have been build up from a number of online sources that go through how to install and configure Kubernetes manually. For the networking, I went with Weave Net after having a few issues with Flannel. There are lots of other networking options for Kubernetes… this is worth a read.

For better DNS resolution on the Guest OS VMs, the hosts file entries are constructed from the IP address settings set in the terraform.tfvars file.

Plan Execution:

The Nodes can be deployed dynamically using a Terraform var option when applying the plan. This allows for zero to as many nodes as you want for the sandbox… though three seems to be a nice round number.

The number of nodes can also be set in the terraform.tfvars file on Line 28. The variable set during the apply will take precedence over the one declared in the tfvars file. One of the great things about Terraform is we can alter the variable either way which will end up with nodes being added or removed automatically.

Once applied, the plan will work through the declaration files and the output will be similar to what it shown below. You can see in just over 5 minutes we have deployed one Master and three Nodes ready for further config.

The next step is to use the kubeadm join command on the nodes. For those paying attention the complete join command was outputted via the Terraform apply. Once applied on all nodes you should have a ready to go Kubernetes Cluster running on CentOS ontop of vSphere.

Conclusion:

While I do believe that the future of Kubernetes is such that a lot of the initial installation and configuration will be taken out of our hands and delivered to us via services based in Public Clouds or through platforms such as VMware’s Project Pacific having a way to deploy a Kubernetes cluster locally on vSphere is a great way to get to know what goes into making a containerisation platform tick.

Build it, break it, destroy it and then repeat… that is the beauty of Terraform!

References:

https://github.com/anthonyspiteri/terraform/tree/master/deploy_kubernetes_CentOS

 

NSX Bytes: Controller Deployment Gone Bad?

With NSX becoming more and more widely available there are more NSX home labs being stood up and with that the chances of the NSX Controllers failing due to “Home Lab” nested issues become more prevalent. The NSX Controllers are Ubuntu Linux VMs and like any Linux VM are fairly sensitive to storage latency and other issues that appear in #NestedESXi or lab environments.

In one of my labs I came across an issue where I needed to redeploy all the NSX Controllers due to the VMs effectively breaking due to the storage being ripped out from under them…however when I went to redeploy the latency of the underlying nested storage was still not that great and the deployment got stuck in a loop as shown below.

No matter what I tried…vCenter restart, NSX Manager Reboot or Host Reboot the end result was the status remaining in the spinning state. If I tried to deploy another controller I would get the following error.

Controller IP address allocation failed for reason : cluster already contains controller of IP x.x.x.x

In my case the VM existed with the IP address configured against the VM however I could not access the cli to check NSX Cluster Status due to the fact the VM was in a pretty bad way.

Taking a look at the IP Pool allocations…even though the error said that the IP was in use, it wasn’t listed as such…meaning it was trying to use the first IP in the pool regardless.

Before going into the fix, it should be noted that if this scenario was to happen, and you where down to your last controller in production you would be best served to call up VMware Support and work through the restore options as without any controllers your VXLAN Unicast traffic isn’t going to be updated via the VTEPS and things will eventually grind to a halt. It’s also worth reading the VMware Docs on what to do if even one Controller is lost in a cluster. If this is in a lab scenario…we can be a little harsher!

While the Controller status is spinning in a Deploying state you can’t interact with it via the Web Client. You need to turn to the API to delete the NSX Controller and start again or deploy a new cluster set. First you will need the CONTROLLER-ID which can be easily seen via the Web Client. To remove the controller you need to call the API below using the Delete method. If the stuck controller is the last one in the cluster you need to add the ?forceRemoval=True option at the end of the call.

Once complete you should get a 200 status and a job data ID. If you check back in at the Web Client you should see the Controller VM being deleted and it being removed from the list under Controller Nodes. We are now free of the Deploying Loop and can rebuild or extend the NSX Controller cluster as is appropriate.

References:

https://pubs.vmware.com/NSX-62/index.jsp?topic=%2Fcom.vmware.nsx.admin.doc%2FGUID-3A84E9D1-CAC0-41B1-B45C-E032B230DB49.html