Exchange 2013 and Bugcheck 0x000000ef // It’s always my problem

After our recent Exchange 2013 rollout, we noticed a problem with the Exchange 2013 servers (virtual guests on a HyperV cluster) experiencing clock drift and ultimately bugchecking (aka blue screen) with 0x000000ef errors.

These crashes and clock drifts occurred once every couple of days and quite aside from the crashes, clock drift is a very big deal on any kind of server these days. While the crashes were disruptive in their own right, the potential problems caused by someone having to wait a bit to access their mailbox or receive a message is nothing to the problems that could be caused by the timestamp on an email being a day or two out.

With three servers, we were seeing at least one bugcheck per day. Always the same error (0x000000ef) and always with clock drift in the vent logs.

Digging into the problem:

If you are a professional Windows sysadmin, I encourage you to read the MSDN articles on interpreting bug check codes and crash dump analysis. While the idea of reading a memory dump like this may seem a little daunting, it's an essential component of supporting complex server environments where problems sometimes happen.

After some work by both us and Microsoft PSS, we made a couple of discoveries:

The clock drift seemed to be caused by I/O issues.
The Microsoft Exchange Health Management Service was failing to start properly (quite a common issue), or starting and then failing, and this was causing wininit.exe to crash, which generated our bugcheck.

Solving the problems.

Please only be tempted to make either of the changes below if you're actually experiencing the problems we had.

The simplest problem to solve was the I/O issue. We dug into this in case it was a hyperV cluster issue but Microsoft support decided that this was fine and that the error was due to a problem with Windows Server 2012. This was resolved by installing hotfix KB 2870270 on all the Exchange server guest VMs and all the hosts in the cluster.

Secondly, the Exchange Health Management service issue. There are two parts to this solution. Firstly, I changed the service from Automatic startup to Automatic (Delayed Start). This seemed to resolve the errors with the service struggling to start in the first place.

To improve service reliability once running, Microsoft PSS made the following change to the system monitoring configuration from the exchange powershell prompt:

Add-GlobalMonitoringOverride -Identity exchange\ActiveDirectoryConnectivityConfigDCServerReboot -ItemType Responder -PropertyName Enabled -PropertyValue 0 -Duration 60.00:00:00

(note that this is a one line powershell command).