Leap Second Bug: Worth a Double Check...

In 2008 I vividly remember the impact that leap year/day/seconds can have on systems that are not prepared to handle the changes in time or date. It was the 29th of February and at the time I was working for a Service Provider offering Hosted Exchange services based on Exchange 2007. All off a sudden my provisioning scripts stopped working and we could not add, remove or modify Exchange Mailboxes. After a day of frustration working with MS Support and dreading a full system rebuild the problem seemed to disappear the following day…the 1st of March. At the end of the day and after a couple of days of Microsoft scratching their head the Exchange Engineering team realised that they hadn’t allowed for the leap day somewhere deep in the bowls of their code which resulted in all account modifications not working during the 24 hours of the leap day. Fast forward five years and the Earth’s rotation continues to slow and we have a situation where system administrators and operations teams need to be aware of another out of the norm situation that could affect systems and platforms. This time it’s due to a (http://www.cl.cam.ac.uk/~mgk25/time/leap/) adjustment which is scheduled for 30th of June 2015 at 23:59:60 UTC and it may cause issues for devices and operating systems that are NTP synchronised. Older Linux kernels seem to be the most affected by leap second with most vendors releasing KB articles regarding the leap second impact and how to work around it. While this is not something that will bring down the internet it’s still something that all infrastructure IT professionals should be aware of and be double checking all systems to ensure there are no embarrassing time related incidents come the 30th of June. ESXi and Other VMware Products: As per this KB, ESXi is not impacted by the leap second bug…but other appliance based solutions (mostly SUSE based) look to require the (http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2121016).

ESX/ESXi utilizes the RFC-1589 clock model, appropriately handling leap seconds. It is not necessary to enable Slew Mode for NTP in ESX/ESXi’s NTP client, or to otherwise work around leap seconds by disabling and re-enabling the NTP client before and after the leap second’s occurrence. For more information, see Enabling Slew Mode for NTP (2121016). However, while ESX/ESXi server is not expected to experience negative impact from a leap second taking place, it remains possible for Guest Operating Systems and/or running applications to experience an impact, independent of ESX/ESXi, if it is not designed to handle one. VMware recommends customers to test their complete solutions.

This KB lists all the affected platforms and the suggested fixes for them. For vCloud SPs running (http://kb.vmware.com/selfservice/search.do?cmd=displayKC&docType=kc&docTypeID=DT_KB_1_1&externalId=2121201)… as most Cells run off Red Hat Enterprise there should be no impact, however it’s worth double checking as time skew is the number one enemy of vCloud Director IaaS platforms. .

Service Providers:

While most Cloud providers don’t manage client Operating Systems directly it would be a good move to put out some form of advisory so that clients protect their VMs before the leap second hits…if not there could be a lot of angry service desk calls relating to increased and unexplained CPU usage, application slowdowns, application crashes, and failures on startup.

UPDATE: vCNS Appliance Warning:

Thanks to @egrigson for pointing out that hidden away in the above KBs was another KB specifically referencing (http://kb.vmware.com/selfservice/search.do?cmd=displayKC&docType=kc&docTypeID=DT_KB_1_1&externalId=2120837).

Details

When a leap second is added (for example, on June 30, 2015 at 23:59:60 UTC), the VMware vCloud Networking and Security (vCNS) Manager/App/Edge/Data Security appliances may become non-responsive. This issue can arise only at the time of the addition of the leap second. The issue occurs due to errors in leap second handling in the vCNS appliance kernels.

Symptoms of this non-responsive state include the following errors when attempting to connect to an affected vCNS appliance:

API/SSH(TCP) connection not possible

CLI input/output not possible

All versions of the vCNS appliances (vCNS Manager/App/Edge/Data Security) are affected. VMware NSX for vSphere 6.x installations may be configured to operate in a backward-compatibility mode that includes VCNS appliances. The vCNS appliances in such installations may also be affected by this issue.

Solution

If the non-responsive condition lasts longer than 30 minutes, restart the affected appliance.

This condition may render Edge Appliances unmanageable from the NSX vCenter Web Client or vCloud Director, requiring a reboot of the edge. This doesn’t affect NSX ESGs as mentioned above…only legacy vCNS Edges.

For all those with vShield Edge Deployments, you might want to prepare for appliance reboots and/or give clients warning of possible disruption of services. I know I’ll be getting out an advisory first thing tomorrow morning for our clients.

1 Commentarchived

egrigson29 June 2015
Good article, although VMware KB2120837 (http://kb.vmware.com/selfservice/search.do?cmd=displayKC&docType=kc&docTypeID=DT_KB_1_1&externalId=2120837), referenced in your links above, implies that vCNS may fail for up to 30 mins, and worse case need a restart. Not a great workaround!
1. Anthony Spiteri29 June 2015
  Yea...that one is kinda of nasty...might need to point this out to more people...doesn't look like there is a way around it