vCOPS 5.8: Critical Data Collection Bug
UPDATE: VMware Global Support supplied me with vCOPs 5.8.0 Hot Fix 01 Build 1537842 which is available via a support request. This is a complete .pak update so you will need to go through the upgrade process as per usual.
The issue of the missing data has been resolved, however I did need to go through and remove a heap of duplicate entity types in the Custom Dashboard. I have a fully functional vCOPs platform now. Hopefully no more bugs in this build!
Like most…I jumped to upgrade vCOPs from 5.7.x to 5.8 when it went GA mid December. Initially the upgrade completed without issue and the four vCenter’s registered looked to continue collecting data as expected.
# To cut to the guts of the error scroll to the bottom on the post.
Shortly after the upgrade I went through and performed an upgrade of vCenter from 5.0 to 5.1 in one of our sites. Upon completing the upgrade and having a look at the Custom Dashboards we have setup, I noticed that 90% of the hosts in the recently upgraded vCenter where showing as white boxes (no data)
Looking through the Analytic VM collector logs I found these entries that seemed to point to a connectivity/communication issue between vCOPs and the Hosts.
After going through a painful couple of support calls with VMware support where they where insistent the issue was with the vCenter (have you tried turning it off and on) and/or a disk space issue on the Analytic VM. They suggested it was a well known upgrade bug that can be resolved as per below.
PowerShell1<strong>The CPU load heat map displays the wrong color</strong>
PowerShell1In the custom interface, the CPU load heat map under Heatmaps dashboard > CPU Load widget > Hosts by CPU Usage, displays the color for memory usage instead of CPU usage. This applies to upgrades from vCenter Operations Manager 5.7.1 or 5.7.2 to 5.8. It does not apply to a new installation of vCenter Operations Manager 5.8.
PowerShell1<strong>Workaround</strong>: To configure the correct color display for the CPU Load Widget:
PowerShell1Select the CPU Load Widget and click the edit widget icon.
PowerShell1From the Edit screen, choose the CPU Usage from the "color by" option and choose Usage (%).
PowerShell1Click the update selected configuration icon (image of a disk).
PowerShell1Click <strong>OK</strong> to save the configuration.
Off the bat I knew this wasn’t my issues because initially on upgrade I didn’t have the issue. While I waited for support to try and diagnose the issue via vCOPs support log bundles, I assumed that the issue was in the data…possibly a bad row in the database relative to a host/vCenter.
admin@localhost:~> vcops-admin purge
This operation will delete all the data not associated with the currently registered adapters.
Are you sure you want to continue ? (yes/no)
The command ran for about 20-30 minutes and returned as being successful. When I went back into vCOPs the affected vCenter’s hosts had returned and was actively collecting and reporting data…however I now saw a number of other hosts across multiple vCenters showing showing the same problem!
I contacted VMware support again and had the case escalated which resulted in the admission that there was a newly discovered (probably through my persistence in regards to the case) bug in 5.8 and that I was experiencing all the symptoms.
Issue in 5.8 where we have the below symptoms:
- One or more ESXi/ESX hosts are no longer present in the vCenter Operations Manager inventory.
- ESXi/ESX hosts are missing from the vCenter Operations Manager inventory.
- Child objects of missing ESXi/ESX hosts such as virtual machines and datastores are present in the vCenter Operations Manager inventory.
- This issue occurs when you place the ESXi/ESX hosts into Maintenance Mode in vCenter Server and then take the hosts out of Maintenance Mode.
Review the Below KB:
So, at the moment there is no resolution or fix and the workarounds are pretty nasty! …basically you will be modding database entries and taking snapshots of the Analytic VM.
I just got off the phone with VMware support in Palo Alto and got told that a hotfix was being worked on and should be available soon. Once released to me, i’ll update this post with the final resolution.