Category Archives: Quick Fix

vCloud Director 9.0: Manual Quick fix for VXLAN Network Pool Error

vCloud Director 9.0, released last week has a bunch of new enhancements and a lot of those are focused around it’s integration with NSX. Tom Fojta has a what’s new page on the go with a lot of the new features being explained. One of his first posts just after the GA was around the new feature of being able to manually create VXLAN backed Network Pools.

VXLAN Network Pool is recommended to be used as it scales the best. Until version 9, vCloud Director would create new VXLAN Network Pool automatically for each Provider VDC backed by NSX Transport Zone (again created automatically) scoped to cluster that belong to the particular Provider VDC. This would create multiple VXLAN network pools and potentially confusion which to use for a particular Org VDC.

In vCloud Director 9.0 we now have the option of creating a VXLAN backed network pool manually instead of one being created at the time of a setting up a Provider vDC. In many of my environments for one reason or another the automatic creation of VXLAN network pool together with NSX would fail. In fact my current NextedESXi SliemaLabs vCD instance shows the following error:

There is a similar but less serious error that can be fixed by changing the replication mode from within the NSX Web Client as detailed here by Luca, however like my lab I’ve know a few people to run into the more serious error as shown above. You can’t delete the pool and a repair operation will continue to error out. Now in vCD 9.0 we can create a new VXLAN Network Pool form the Transport Zones created in NSX.

Once that’s been done you will have the newly created VXLAN Network Pool that’s truly more global and tied to best practice for NSX Transport Zones and one that can be used with the desired replication mode. The old one will remain, but you can now configure Org vDCs to consume the VXLAN backed network pool over the traditional VLAN backed pool.

References:

vCloud Director 9: What’s New

vCloud Director 9: Create VXLAN Network Pool

Quick Fix: OVF package with compressed disks is currently not supported

A couple of weeks ago I ran into an issue stopping me from importing an OVA and today I came across another issue relating to the Web Client not able to import OVF packages with compressed disks.

There seems to be a lot of issues to do with OVF/A operations in vSphere 6.5 Update 1…in fact there are 187 mentioned of OVF and 95 mentions of OVA in the release notes. Searching through the release notes I found a specific entry relating to this issue that I came across and it’s work around.

Deploying an OVF template containing compressed file references might fail
When you deploy an OVF template containing compressed files references (typically compressed using gzip), the operation fails.

The following is an example of an OVF element in the OVF descriptor:
<References>
<File ovf:size="458" ovf:href="valid_disk.vmdk.gz" ovf:compression="gzip" ovf:id="file1"></File>
</References>

The workaround is to download OVFTool and run a simple command to convert the OVF or OVA template to one without the compressed file…which in effect its just a copy of the original.

Seems like a strange fix but it works!

References:

https://docs.vmware.com/en/VMware-vSphere/6.5/rn/vsphere-esxi-vcenter-server-65-release-notes.html

Quick Fix – Unable to Upgrade Distributed Switch After vCenter Upgrade

This week I upgraded (and migrated) my SliemaLabs NestedESXi vCenter from a Windows 6.0 server to a 6.5 VCSA …everything went well, but ran into an issue when I went to upgrade my distributed switch to 6.5.0. Even though everything appeared to be working with regards to the host and VM networking associated with the switch, when I went to upgrade it I got the following error:

Doing a quick Google for Unable to retrieve data about the distributed switch came up with nothing and clicking on next didn’t do anything actionable. A restart of the Web Client and a reboot of the VCSA didn’t resolve the issue either.The distributed switch in question was still on version 5.5 as I forgot to upgrade it to 6.0 during the upgrade to vCenter 6.0. Weather that condition somehow caused the error I am not sure…regardless the quick fix or better said…work around is pretty simple; Use PowerCLI.

Interestingly the Vendor is different…though not sure this caused the issue. In any case the work around is to upgrade the distributed switch using the Set-VDSwitch command.

And success!

I’m not sure what caused the error to appear in the Web Client but the workaround meant that it became a moot point. Suffice to say if you come across this error in your Web Client when trying to upgrade a distributed switch…head over the PowerCLI.

 

Quick Fix: VCSA MAC Address Conflict Migration Fail + Fix

The old changing of the MAC address causing NIC/IP issues has raised it’s ugly head again…this time during a planned migration of one of our VCSA from one vCenter to another vCenter. Ran into an issue where the source VM running on an ESXi 5.5 Update 3a host, had a MAC address that wasn’t compatible with the destination host running ESXi 6.0.

Somewhere along the lines during the creation of this VM (and others in this particular Cluster) the assigned MAC address conflicted with the reserved MAC ranges from VMware. There is a workaround to this as mentioned in this post but It was too late for the VCSA and upon reboot I could see that it couldn’t find eth0 and had created a new eth1 interface that didn’t have any IP config. The result was all vCenter services struggled to start and eventually timed out rendering the appliance pretty much dead in the water.

To fix the issue, firstly you need to not down the current MAC address assigned to the VM.

There is an additional eth interface picked up by the appliance that needs to be removed and an adjustment made to the initial eth0 config…After boot, wait for the services to time out (can take 20-30 minutes) and then ALT-F1 into the console. Login using the root account and enable the shell.

  • cd /etc/udev/rules.d/
  • Modify 70-persistent-net.rules and change the MAC address to the value recorded for eth0.
  • Comment out or remove the line corresponding to the eth1 interface.
  • Save and close the file and Reboot the Appliance.

All things being equal you should have a working VSCA again.

References:

https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2012451

Quick Post: Web Client vs VI Client Permissions with VCSA

I’ve been using the VCSA for a couple of years now since the release of vSphere 5.5 and have been happily using the upgraded 6.0 version for a couple of my environments As with most people I found the adjustment going from the VI Client to the new Web Client to be a little rough and I do still find myself going between the two while performing different tasks and configuration actions.

I caught this tweet from Luis Ayuso overnight which he was asking if I had found out the answer to a tweet I had put out almost a year ago meaning it had had a Google Hit as the best response.

After Luis’s issues I decided to put together a very quick post outlining in a basic way what needs to be configured for like for like access in both the Web Client and in the VI Client. In this scenario I have a single VM deployment of the 6.0 VCSA with a simple install of the Platform Services Controller and a SSO Domain configured and the VCSA connected and configured to a local Active Directory.

Let’s start by logging in with a user that’s got no permissions set but is a member of the AD domain. As you can see the Web Client will allow the user to log in but show an empty inventory…the VI Client gives you a “You Shall Not Pass!” response.

I then added the user to the AD Group that had been granted Administrator permissions in the VI Client at the top level.

These match what you see from the Web Client

Logging back into the VI Client the user now has full admin rights

However if you log into the Web Client you still get the Empty Inventory message. To get the user the same access in the Web Client as the VI Client you need to log into the Web Client using the SSO Admin account, head to Administration -> Users and Groups -> Groups and select the Administrators group in the main window. Under Group Members search the AD Domain for the user account or group and add to the membership.

Now when you log into the Web Client with the user account you should see the full inventory and have admin access to perform tasks on vCenter Objects.

This may not be 100% best practice way to achieve the goal but it works and you should consider permission structures for vCenter relative to your requirements.

Quick Fix: ESX 4.1 Host Stops Responding When iSCSI LUN is “pulled”

REMOVING DEAD PATHS IN ESX4.1 (version 5 guidance here)

Very quick post in relation to a slightly sticky situation I found myself in this afternoon. I was decommissioning a service which was linked to a VM which had a number of VMDKs, one of which was located on a dedicated VMFS Datastore…the guest OS also had a directly connected iSCSI LUN.

I choose to delete the LUNs first and then move up the stack removing the VMFS and eventually the VM. In this I simply went to the SAN and deleted the disk and disk group resource straight up! (hence the pulled reference in the title) Little was I to know that ESX would have a small fit when I attempted to do any sort of reconfiguration or management on the VM. The first sign of trouble was when I attempted to restart the VM and noticed that the task in vCenter wasn’t progressing. At that point my Nagios/OpsView Service Check’s against the ESX host began to timeout and I lost connectivity to the host in the vCenter Console.

Restarting the ESX management agents wasn’t helping and as this was very much a production host with production VM’s on it my first (and older way of thinking) thought of rebooting it wasn’t acceptable during core business/SLA hours. As knowledge and confidence builds with experience in and around ESX I’ve come to use the ESX(i) shell access more and more…so I jumped into SSH and had a look at what the vmkernal logs where saying.

So from the logs it was obvious the system was having major issues (re)connecting to the device I had just pulled out from under it. On the other hosts in the Cluster the datastore was greyed out and I was unable to delete it from the Storage Config. A re-scan of the HBA’s removed the dead datastore from the storage list so if I still had vCenter access to this host a simple re-scan should have sorted things out. Moving to the command line of the host in question I ran the esxcfg-rescan command:

And at the same time while tailing the vmkernal logs I saw the following entries:

From tailing through those logs the rescan basically detected that the path in question was in use (bound to a datastore where a VMDK was attached to a VM) reporting the “Device is in use by Worlds” error. The e rrors also highlights dead paths due to me removing the LUN while in use.

The point at which the host went into a spin (as viewed by seeing the Could not select Path for device in the vmkernal log) was when I attempted to switch on the VM and the host (still thinking it had access to the VMDK) trying to access all disks.

So lesson learnt. When decommissioning VMFS datastores, don’t pull the LUN from under ESX…remove it gracefully first from vSphere and then you are free to delete on the SAN.

 

Load Balancer Internal IP’s Appearing in IIS/Apache Logs: Quick Fix

If you are NAT’ing public to private addresses with a load balancer in between your web server and your Gateway/FireWall device you might come across the situation where the IIS/Apache logs report the IP of the Load Balancer, when what you really want, is the client IP.

It’s obvious that the biggest issue with this is that any Log Parser/Analytic’s you do against the site will all be relative to the IP of the load balancer. All useful client and geographical information is lost.

Most Load Balancer’s get around this by inserting a Header into the packet that relates to Client IP. In most cases that I have seen, both Juniper and NetScalers the Header is set to rlnclientipaddr.

What needs to be done at the web server configuration level to help pick up on and translate the header info so it can be used to translate the correct client IP into the log files. There are obviously different way to achieve this in Apache, compared to IIS and Apache has a much simply solution than IIS.

Apache:

In your apache.conf go to the LogFormat sections and modify the default format as shown below (Replace the Red text with the green text) and restart the Apache Service.


IIS
:

The IIS 5/6/7/8 solution is a little more involved, but still just as efficient and not overly complicated at the end of the day…in fact for me the hardest part was actually chasing up the DLL’s linked below. It must be noted that while this has worked perfectly for me against both a Juniper DX and NetScaler VPX load balancer I would suggest testing the solution before putting it into production. Reason being is that the ISAPI filters are specifically sourced for the Juniper DX series, but in my testing I found that they worked for the NetScalers as well. Sourcing the x64 DLL’s was a mission, so in this I am saving you a great deal of time by provided the files below.

rllog-ISAPI

Download and extract those files into your Windows root. Go to the Features View -> ISAPI Filters and Click on Add. Enter in the Name and Executable Location and click ok. Note that it’s handy to add both 32 and 64 bit version to a 64bit IIS Web Server just in case you are dealing with legacy Application that are required to run in 32bit mode. Adding the ISAPI Filter at the root config of the Web Server so it propagates down to all existing sites and any newly created sites.

isapi_dll

SharePoint 2010 Web UI Timeout Creating Web Application: Quick Fix

Had a really interesting issue with a large SharePoint Farm instance we host… over the last couple of days when we tried to create a new Web Application the task was failing on the SharePoint Farm members. While being initially thrown off by a couple permission related event log entries for SharePoint Admin database access there was no clear indication of the problem or why it starting happening after weeks of no issues.

The symptoms being experienced was that from the Central Admin Web Site ->; Application Management ->; Manage Web Application page, creating a New Web Application would eventually return what looked like a HTTP timeout error. Looking at Central Admin page on both servers, it showed the Web Application as being present and created and the WSS file system was in place on both servers…however the IIS Application Pool and Website where only created on the server that ran the initial New Web Application. What’s better is that there where not event logs or SharePoint logs that logged the issue or cause.

sp_timeout_error

In an attempt to try and see a little more verbose logging during the New Web Application process I ran up the new-SPWebApplication PowerShell command below:

New-SPWebApplication -Name “www.site.com.au443” -Port 443 -HostHeader “www.site.com.au” -URL “https://www.site.com.au” -ApplicationPool “www.site.com.au443” -ApplicationPoolAccount (Get-SPManagedAccount “DOMAIN\spAppPoolAcc”) -DatabaseServer MSSQL-01 -DatabaseName WSS_Content_Site -SecureSocketsLayer:$yes -Verbose

While the output wasn’t as verbose as I had expected, to my surprise the Web Application was created and functional on both servers in farm. After a little time together with Microsoft Support (focusing on permissions as the root cause for most of the time) we modified the Shutdown Time Limit setting under the Advanced Settings of the SharePoint Central Admin Application pool:

sp_timeout

The Original value is set to 90 seconds by default. We raised this to 300 and tested the New Web Application function from the Web UI which this time was able to complete successfully. While it does make logical sense that a HTTP timeout was happening, the SharePoint farm wasn’t overly busy or under high resource load at the time, but still wasn’t able to complete the request in 90 seconds.

One to modify for all future/existing deployments.