After our recent Exchange 2013 rollout, we noticed a problem with the Exchange 2013 servers (virtual guests on a HyperV cluster) experiencing clock drift and ultimately bugchecking (aka blue screen) with 0x000000ef errors.
These crashes and clock drifts occurred once every couple of days and quite aside from the crashes, clock drift is a very big deal on any kind of server these days. While the crashes were disruptive in their own right, the potential problems caused by someone having to wait a bit to access their mailbox or receive a message is nothing to the problems that could be caused by the timestamp on an email being a day or two out.
With three servers, we were seeing at least one bugcheck per day. Always the same error (0x000000ef) and always with clock drift in the vent logs.
Digging into the problem:
If you are a professional Windows sysadmin, I encourage you to read the MSDN articles on interpreting bug check codes and crash dump analysis. While the idea of reading a memory dump like this may seem a little daunting, it’s an essential component of supporting complex server environments where problems sometimes happen.
After some work by both us and Microsoft PSS, we made a couple of discoveries:
- The clock drift seemed to be caused by I/O issues.
- The Microsoft Exchange Health Management Service was failing to start properly (quite a common issue), or starting and then failing, and this was causing wininit.exe to crash, which generated our bugcheck.
Solving the problems.
Please only be tempted to make either of the changes below if you’re actually experiencing the problems we had.
The simplest problem to solve was the I/O issue. We dug into this in case it was a hyperV cluster issue but Microsoft support decided that this was fine and that the error was due to a problem with Windows Server 2012. This was resolved by installing hotfix KB 2870270 on all the Exchange server guest VMs and all the hosts in the cluster.
Secondly, the Exchange Health Management service issue. There are two parts to this solution. Firstly, I changed the service from Automatic startup to Automatic (Delayed Start). This seemed to resolve the errors with the service struggling to start in the first place.
To improve service reliability once running, Microsoft PSS made the following change to the system monitoring configuration from the exchange powershell prompt:
Add-GlobalMonitoringOverride -Identity exchange\ActiveDirectoryConnectivityConfigDCServerReboot -ItemType Responder -PropertyName Enabled -PropertyValue 0 -Duration 60.00:00:00
(note that this is a one line powershell command).
Experienced a similar problem on our Exchange 2013 servers, and was given slightly different parameters to try, but still the same cmdlet.
Add-GlobalMonitoringOverride -Identity Exchange\ActiveDirectoryConnectivityConfigDCServerReboot -ItemType Responder -PropertyName Enabled -PropertyValue 0 -ApplyVersion “15.0.712.24”
This problem occurs in multi-domain forests. Here’s the KB article describing the problem and recommending the fix: http://support.microsoft.com/kb/2883203
Here’s an official description of the problem and fix: http://support.microsoft.com/kb/2883203
Thanks @joesuffceren.
Having this problem and found this blog article, thanks for the info! So there’s two versions of the command we can run. One version sets a duration of 60 days … does that mean in 60 days I need to run the command again. The other version applies the changes to a specific version of exchange … does that mean when I update Exchange I need to run the command again with the new version number?
To remove this property use the following powershell:
Remove-GlobalMonitoringOverride -identity exchange\activedirectoryconnectivityconfigdcserverreboot -itemtype responder -propertyname enabled
@nnellans The thought process behind the version number limited command is that the next version of Exchange will fix the bug, so you won’t need to run the command again because the problem will no longer be there. If the problem is still there, then, yes, you’d need to run the command with the new version number.
Or use the timed command, and remove it and re-add it every 60 days. That’s what I just had to do, and why I thought I’d add the removal command above to make it easy for others to do it too.
I am having the same error NOT related at all to VMs – physical only.
Also, no clock drift and, as far as I know, no problems with the Health service starting.
We have Dell R900 server, all the latest firmware, BIOS, etc.
About every 24 hours, we get that 000000xef Bug check reboot, CRITICAL_PROCESS_DIED. We are running Windows Server 2012 R2 SP1 (Standard edition), Exchange 2013 SP1 with CU8 (Enterprise).
We are in an UNSUPPORTED mode, since we have a SAS 5/E card hooked to an MD3000 (not the “I” – regular MD3000 SAS). We have all Dell certified drives. I am not seeing any useful errors, just a couple of ghost errors in the OMSA ESM log during reboot, about 2 bad drives – and we did have at least one bad drive, and I replaced it – and MDSM showed that it rebuilt fine and all is ‘optimal.’ I used “dumpview” or similar tool and was able to gather at least the following info from the crash:
CRITICAL_PROCESS_DIED, 0x000000ef, Parameter 1: ffffe004`de6b7080, then Params 2-4 are all zeroes. Processors 24 (cores, really), Major version 15, Minor: 9600, Dump size 3,091,662,555. I did save a copy of the dump.
It has nothing to do with load; no load, and only one test mailbox.
I am on the latest MDSM and have tested multi-pathing; and it failed over from one to the other, so all that is working fine.
Currently, it is a “one-node-DAG” and I point to a non-Exchange server as a file-share-witness, with proper permissions, etc.
No useful errors in any of the logs (ESM, Event logs, etc.); it’s just like “power is removed.” Power supplies seem okay, building power is great, conditioned, etc. I have no idea about getting further details from the dump, but happy to send it to someone to look at. I know that the server was stable when first installed but that, when I introduced the DAG + Cluster (Exchange install puts on the cluster portions), then it started this 24-hour rebooting.
Any help and/or further clues would be greatly appreciated.
Thanks!
Did you ever figure this out? I have this exact same problem. 24 hours after boot for the past two weeks, just noticed the pattern.
Hi folks,
same as Robosleuth, with have vmware win2k12 R2 with Exchange 2013 CU6 and get the same bsod. Any clue ?
Thanks
@Robosleuth
Why in the world would you run a one node DAG ?
i have exactly the same setup as m05lim5h4d0w with 4x Exchange 2013 CU9 on different location of the globe. Three of them are crashing daily with this bugcheck and one of them has uptime of 162 days and has not been touched by this- its so strange. we added the GlobalMonitoringOverride 30 days ago but it has not improved things at all – HELP!