Featured, Hypervisors, Virtualization Software, VMware

Understanding the Top 4 Performance Metrics in vSphere

vSphere environments can be complex beasts that require a lot of attention and ongoing tuning to get “just right.”  When it comes to ensuring the ongoing acceptable performance of that vSphere environment, there exist hundreds of different performance metrics.  vSphere administrators must traverse the performance metric minefield to identify those metrics that truly matter and that have the most significant positive impact on the vSphere environment.

From a high level perspective, a typical virtual environment has four primary resource areas:

  • Compute.  The compute tier is responsible for executing operations and supports running the code that provides an environment for the virtual machines.  The compute layer is made up of servers, each two one or more processors and with each of those processors bearing one or more processing cores.
  • Memory.  Memory is where programs run and is often one of the first resources to be exhausted in a virtual environment.
  • Storage.  When programs aren’t running, they are stored on disk along with the data that is required by said programs.  In addition, in many modern data centers, servers are scheduled to boot directly from storage, so there are often files related to the operating system (vSphere) located on storage as well.
  • Network.  The network imbues upon the virtual environment the ability to communicate with the world outside.

Now, suppose you were tasked with identifying the one performance metric for each of these areas that is absolutely critical to ongoing operation.  The sections below describe the primary performance metric in each of the areas mentioned above.  In future articles, I will dive much deeper into each of these areas.

vSphere administrators must traverse the performance metric minefield to identify those metrics that truly matter…

CPU usage

I have to admit that I was torn between whether to list CPU usage or CPU ready as the primary metric in the compute resource space.  CPU usage provides an administrator with an at-a-glance look at how far the virtual environment is being pushed while the CPU ready metric is a measure for how long it takes for enough physical processor resources to be freed up to service a virtual machine’s needs.  As this article is focused on overarching performance metrics, I decided to focus on the most obvious performance-related metric.

With that said, what is good and what is bad with regard to CPU usage in a virtual environment?  Many agree that a consistent physical CPU utilization metric that stays well above 75% – 80% is the point at which an administrator should begin to consider adding processor resources.  Even though that still leaves 20% of room, bear in mind that many loads still need room to “peak” at certain times, so it’s important to leave a little room.

Disk latency

Disk latency is the amount of time that it takes for a read or write operation sent to a storage device to actually be processed.  This metric takes into account delays that might be introduced due to RAID and other storage configuration decisions.

If you’re seeing average disk latency exceeding 20 milliseconds or so or if you’re seeing major spikes on an ongoing basis, you may be running into issues that could be workload affecting.  For example, I’ve seen disk latency issues break Microsoft clusters.

Obviously, every workload has different levels of tolerance, so do a bit of study before jumping to too many conclusions.  Further, when it comes to disk latency, there can be any number of causes, each of which calls for a different resolution.

  • Perhaps you have too few disk spindles serving too many I/Os.  Solutions may include adding additional spindles or moving to solid state disks.  You may also need to investigate the storage head end to make sure it can keep up with the load.
  • Sometimes, adding additional RAM to a virtual machine can be helpful in these cases, but only if adding that RAM has the potential to lead to less I/O hitting storage.
  • If you have antimalware software on your virtual machines, make sure it’s not running at the same time on all of your VMs.  That activity can spike I/O, increasing overall latency.
  • Make sure no virtual machines are swapping to disk.  See the next item for more information.

Memory swapping on host or virtual machines

In many virtual environments, RAM is the first resource to be exhausted.  Memory swapping – the activity that takes place when RAM is depleted and the system turns to disk as a memory cache – has a profoundly negative impact on performance.  If you see even a single virtual machine swapping to disk, you’re going to see the impact.  As such, make sure that you have assigned enough RAM to each of your virtual machines and make sure that the host has enough RAM to serve all of the running VMs.  In a vSphere environment, if you see swapping taking place, your environment is probably massively overcommitted as swapping to disk is the last of many memory management techniques that vSphere has at its disposal.

As I mentioned before, memory swapping can also have an impact on disk I/O and latency since swapping adds additional I/O load to storage.

If you see any swapping going on, it’s too much.

Network utilization

As is the case with CPU, the most obvious networking performance metric is also the most important.  If you see constant network link utilization above 80% or so, it’s time to add more network capacity.  This can be accomplished by moving to 10 GbE or adding additional bonded network adapters to the existing system.

Summary

It’s important to note that the information presented in this article is just one way to look at the virtual environment.  Every environment is different, so make sure you understand how your workloads operate and use appropriate metrics.  In future articles, I’ll dig deeper into metrics and show you various ways by which you can monitor the health of your vSphere environment.