Optimising for a virtual environment // It’s always my problem

In my last post, I talked about optimising virtual machine builds for the virtual environment. In this post, I shall talk about how to do just that.

My employer implemented a VMWare ESX farm in 2008, using the “enterprise” level of their product, which includes tools like vCenter and vMotion. We imported a large number of physical machines into virtual machines and created a number of new virtual machines to deal with various projects.

We’re probably between stage 3 and stage 4 on my list; many of the physical machines we imported into vmware have been “tweaked” for a virtual environment but are not optimised in the way they would be if they were re-created from scratch right now. If your virtualisation implementation includes importing any physical machines at all then I think its probably impossible to avoid passing through stage 3.

As we create replacements for imported virtual machines as part of the lifecycle of the OS and application installs they represent, we are both moving to a more virtualisation friendly selection of guest OSes and applications and learning more about guest optimisation. It’s probably fair to say that adding a small amount of memory (8Gb) to each host and doing a lot of work on the cpu and memory resources allocated to our guests has given the performance of our current virtual farm a serious boost, certainly we are very happy with the performance we are seeing on a system that is about halfway though what we plan to be a six year hardware life-cycle.

General notes on building Virtual Guests

First of all, a note on scaling servers to deal with workload. There are two ways to do this: scaling upwards or scaling out. Scaling upwards refers to expanding your ability to handle higher workloads by putting as much workload onto one system as possible and upgrading that one system to its highest limits, and scaling out refers to creating many systems that each hold a piece of the pie

In non-virtual systems it has been traditional to “scale up” to at least some degree to make the best use of the hardware the system is running on and minimise the number of licences you might need to buy if you’re using, say Windows.

This is why you might find systems that were designed to be file servers which also turn out to be running DNS, DHCP and RADIUS, who wants to buy new server hardware to do those jobs when the file server is sitting there 60% idle?

When you first go virtual its tempting to continue these practices. After all it’s what you’ve been used to to doing in the past and it works. Certainly this is likely to be the case with machines you’ve imported into the virtual environment with a physical to virtual converter.

When setting up physical hosts, it can become common practice to “over-allocate” resources, because RAM is cheap to buy and it can be difficult to upgrade a server’s processor midway through its life-cycle, so its easier to over-estimate from the start (and besides, who knows what other roles you may find you need to install on this new box too!).

However, this approach might not always be the best method for new virtual machines. If you want to load balance your virtual guest workload over your virtual hosts effectively then it makes sense to do two things.

Minimise the number of roles each guest performs, to reduce the amount of resources each guest needs.
Monitor each guest's resource use reasonably tightly (remember you can automate this against a set of alerts, I'm not talking about having an employee sitting staring at a screen here!) to ensure that its using resources effectively - this might mean reducing or increasing resource allocations to make better use of memory, processor or disk capacity.

It does make sense to group some services together, (e.g. on an internal Windows LAN its perfectly common and reasonable to have AD, DNS and DHCP roles on the same box in all but very large or complex deployments) and this does still hold true in a virtualised environment but for virtualised environments, rather than "scale up" do try to think about "scale out".

Not does scaling out improve flexibility of resource allocation (for example, using vmotion/live migration to ensure all your virtual hosts are sharing workload reasonably well) but this also improves availability - if you have all your AD and file server roles separate then a patch installation or other fault that requires taking one of your AD servers offline won’t also affect the availability of shares on the file server.

You can also use more resource-effective operating systems where the jobs being administered are kept simple (even within the Windows world, we have started using Windows 2008 r2 core for AD and external DNS and the like)

Virtual Machines are not magic

One thing to remember when doing capacity planning is that virtualising a system doesn't fundamentally change its needs. Most people understand this statement by itself but don't fundamentally grasp its implications for capacity planning of virtual machines.

Common problems I have seen that spring from this include:

Network throughput issues - if you have two servers on your network that are each capable of saturating a 1Gb network connection and you virtualise them both onto the same host and connect that host to your LAN at 1Gb then you will have performance issues.
Disk throughput issues - similar to the network issue, if you virtualise a number of servers with high disk I/O onto the same shared SAN then be sure that the connection between the SAN and the virtual hosts can handle the combined I/O requirements. Too many people just consider the amount of disk space they need when buying a SAN for this job and then complain about the poor performance of virtual machines.

Optimising guest memory

It's been common practice with conventional server builds to stuff the system full of as much RAM as possible. In fact its been something of a well worn truism that windows server should be given as much RAM as possible, almost to the point that this is considered the default fix for any performance issue with a server by some people.

In fact, while some systems will use as much memory as possible if it is there (e.g. Exchange, SQL Server), it can actually pay to manage memory quite smartly on a virtual system - rather than just allocating 8Gb (or whatever) to everything then really think about what you need. We have some systems working with much less as virtual machines than they would have been given as physical machines with no noticeable drop in performance in the service they provide, and this saving allows us to squeeze more servers onto our VMWare farm, not to mention making more memory available for those systems that really will make proper use of it.

Optimising guest CPU

With CPU resources, as with memory there's always an assumption that more is better. The more megahertz you have the faster things will get done (hopefully a myth well debunked considering how long ago I last ranted about it). The more cores you have the faster things will get done (only true if the "thing" getting done is multithreaded). But there comes a point where adding more CPU resource (whether clock speed or processor cores) doesn't help, because the application is already going at full speed or because the CPU is no longer the bottleneck.

This is an issue with conventional capacity planning (no point buying a quad core server for a system that will only ever use/need two cores) and it becomes a bigger issue with a few more subtle gotchas on a virtual platform.

Firstly and most obviously, if a core is allocated to virtual machine A then it can’t also be allocated to virtual machine B. You can (and indeed should) add plenty of guest machines that can run on the same host because they can all take turns using the same processors, but these processor resources are a finite resource and the more virtual machines you have sharing the same hardware the smaller the slice of resources available to each virtual machine. So far so obvious.

Now this is the subtle part…

As explained in far greater detail in a great blog post here, the problem with this is when you allocate a large number of cores to a virtual machine - it can only run when all those physical cores are available because the guest system expects to have control of those processors even if it has no work to schedule onto them. This means that if you create a virtual host with 8 cores available and create 10 guests with 4 cores allocated to them on that machine you have greatly limited the ability of the virtual host to schedule guest use of the processors effectively (and its even worse than you think - remember the host itself has to grab some CPU time of its own to process all the system overhead required to keep the virtual environment up and running).

This doesn’t mean you shouldn’t create multi-core virtual guests. It does mean that there is a cost to doing so and you need to take the cost into account when setting up your virtual guests. Looking at the large number of virtual guests we imported into our VMWare farm from hardware systems, these had all been sized according to our old “physical” resource planning rules on both the hardware platforms they used to live on and again on the virtual machines we created while importing them. Going over these again with these new thoughts in mind and dropping the few quad core guests we had back to dual core and dropping quite a few dual core guests back to single core actually gave us an overall performance improvement of about 15% to 20%.