This is perhaps another example of my recent comment about SCVMM making harder work of things than perhaps it should, but for all that I want to also say that it’s very likely that the root cause of this error was a mistake on our part. I also want to share this in case someone else has a similar problem.
On one of our clusters, I noticed that one or two guests were failing to migrate to a particular host. This host showed no errors, the guests showed no errors, nor configuration differences between themselves and other machines that migrated to the suspect host without difficulty.
The only error shown in SCVMM is “Error (10698)”. I didn’t notice any errors in either the cluster or individual host event logs, so this was all I had to go on.
After some investigation I found the problem: The network card drivers on the ‘faulty’ host were a newer version than those on the other members of the cluster.
I’m at a loss to explain this. I don’t doubt that we did something wrong to allow this to happen, but I’m not sure what: Automatic updates aren’t enabled on the hosts, I know that the cluster members were built from the same image and had the same patching regime, that the cluster passed verification (both during cluster creation and today while troubleshooting) and showed no errors in the failover cluster manager tool.
I also know that there are only two people who would make this change intentionally. One of them has been away for two weeks and the other was me and I haven’t changed drivers on any server lately… But equally I can’t argue with the evidence in the second screen shot. I can be a little disappointed that the error in the first screen shot was all that SCVMM gave me to go on, however.
I solved the error easily enough by putting the the ‘faulty’ host into maintenance mode and rolling back the network card drivers the same version as all the others. With that done live migration was now able to complete between the problem host and guest machines.