Fundamentals of vSphere Performance Management
Performance monitoring is a critical aspect of vSphere administration. This article introduces you the basic concepts and terminologies in vSphere performance management, for example, performance counters, performance metrics, real time vs historical statistics, etc. Much of the content is based on my book VMware VI and vSphere SDK by Prentice Hall.
Once you understand these basics, the related tools and APIs should be relatively easy. If you are already familiar with vSphere Client performance monitoring or esxtop, they help as well.
Lost VMs or Containers? Too Many Consoles? Too Slow GUI? Time to learn how to "Google" and manage your VMware and clouds in a fast and secure HTML5 App.
A performance counter is a unit of information that can be collected about a managed entity. PerfCounterInfo data object, shown in Figure 1, represents a performance counter. The property key is an integer that uniquely identifies a performance counter, like a primary key of a table in SQL database, and nothing more. There is no guarantee for a performance counter to have a fixed number. In fact, the same performance counter can have different values in ESX and VirtualCenter. Even for the same type of server, the number could change from version to version. Do not use it outside the context of the server you connect to.
Figure 1 PerfCounterInfo data object
The performance counter can be represented by the following dotted string notation:
One sample for such expression is like the following:
This is the performance counter for the average usage of a disk.
In VI SDK 2.5, there are seven pre-defined groups of performance counters: CPU, ResCpu, Memory, Network, Disk, System, and ClusterServices. Inside different groups, there are different counters. For example, the system group has uptime, resourceCpuUsage, and heartbeat counters.
Rollup refers to the process of aggregating statistics so that they can be used in a later time. There are seven rollup types total as defined in PerfSummaryType enumeration type: average, latest, maximum, minimum, none, and summation. Each rollup type represents a different mathematic aspect of the same performance data. You can choose the rollups based on your interests. For example, if you are developing a charge-back solution, you might be more interested in the summation than any other type.
The performance counters are not simple permutation of these three dimensions. Some counters may not have all the rollup types. For example, the system.uptime has only summation type, no other six rollup types.
Moreover, a performance counter also contains other information about the unit, type of statistics and description, level, etc. The available units are listed in PerformanceManagerUnit enumeration type. The type of statistic is listed in PerfStatsType enumeration type. Both of these two enumeration types are included in Figure 1.
The level of a performance counter is an integer valued from 1 to 4, indicating its importance. The lower the level, the more important it is, the more likely it is collected, and the longer it is kept in the VirtualCenter database.
Here are a list of four levels and what counters are included:
- Level 1: includes basic metrics: average usage for CPU, memory, disk, and network; system uptime, system heartbeat, and DRS metrics. It does not include statistics for any device.
- Level 2: includes all counters with rollup types of average, summation, and latest for CPU, memory, disk, and network; system uptime, system Heartbeat, and DRS metrics. It does not include any statistics for device either.
- Level 3: includes all metrics (including device metrics) for all counter groups except these with rollup types of maximum and minimum rollup types.
- Level 4: includes all metrics supported by VirtualCenter, including maximum and minimum rollup types.
Invoking queryPerfCounterByLevel() method can easily get you a list of performance counters in a specific level. On an ESX server, the level is not set for performance counter; likewise, the queryPerfCounterByLevel() is not supported.
All performance counters have their own meanings. When used effectively, they can provide good insight into the system performance. For example, when used CPU time approximates ready time, it may signal contention and possible overcommitment due to workload variability. vSphere API does not help to interpret the performance statistics, check out VMware technotes on performance for more details.
The performance metric represents the actual information being collected. The counter defines only about the type of performance statistic, has not taken into account the target device instances. There might be multiple instances of the device for which the same performance counter can be used. Each combination of performance counter and device instance is a performance metric. The relationship of the performance counter and the performance metric is very much like that of the class and object instance in object oriented programming.
Let us take a look at a quick example. The cpu.usage.average is a performance counter for average CPU utilization. When the counter is collected on CPU No. 1 of a host, a performance metric is formed.
The performance metric is represented by PerfMetricId data object which consists of two parts:
- counterId: The integer that identifies the performance counter.
- instanceId: The name of the instance such as “vmnic1” or “vmhba0:0:0”.
Once it’s clear as which aspect of a device to collect performance data, you need to decide the interval with which the performance data is collected and stored. The interval has to be longer than the sampling interval, which can be found as refreshRate in the PerfProviderSummary data object returned by queryPerfProviderSummary() method, normally 20 second.
Given the constraints of storage, you don’t want to save all the sampled statistics as collected especially when the statistics getting older. The more recent ones are normally stored in a finer grain. When the data gets older, you combine them into longer intervals.
PerfInterval is the data object that represents a historical interval as shown in Figure 2.
Figure 2 The PerfInterval data object
Historical intervals are identified by an interval ID, the number of seconds for which the performance statistics are calculated. For example, for the 30 minute interval, the interval ID is 1800 (60×30).
Each configured interval has a name, e.g. “PastDay”, “PastMonth”, provided by users. The name does not affect system behavior. The configuration of historical intervals in vCenter specifies the scheme which is used to aggregate performance statistics data in vCenter.
Table 1 lists the default configuration of historical intervals at VC server as documented in VI SDK API reference. The default levels are subject to change, and you can modify them in the VI Client connecting to the VirtualCenter by clicking Administration -> VirtualCenter Management Server Configuration. Select the Statistics from the left side list, and change these on the right side panel. Right underneath the configuration is the database size part, which shows how much data it uses in database with the change.
The default historical intervals defined in VC server
|PastDay||300 (5 min)||86400 (1 day)||4||TRUE|
|PastWeek||1800 (30 min)||604800 (1 week)||4||TRUE|
|PastMonth||7200 (2 hour)||2592000 (30 days)||2||TRUE|
|PastYear||86400 (1 day)||31536000 (365 days)||2||TRUE|
Under the default settings, a vCenter server keeps all performance statistics counters (level 4 and above) at 5 minute interval for the past day, and 30 minute interval for the past week. After one week only counters at level 2 are stored at 2 hour interval for the past month and 1 day interval for the past year. All performance data older than one year are removed from the vCenter database.
In ESX, there is only one historical interval “PastDay”, similar to the one in VirtualCenter except that you actually have a length of 129600 (1.5 days) and level is not set. Since ESX is the source of many performance statistics for VirtualCenter, longer history can help to guide against performance data loss caused by various issues.
As of SDK 2.5, you should neither create a new performance interval nor delete an existing one. You can change the existing intervals to some extent. The rule is a little complicated. In general, you should avoid changing the intervals as much as possible except the levels.
Real-time Versus Historical Performance Statistics
There are two categories of performance data in the system. One is the raw performance samples collected at a pretty fast pace, for example every 20 seconds for a new sample. The interval can be found using the queryPerfProviderSummary() method and could vary from managed entity to managed entity. You cannot retrieve performance data more frequent than the real time samples.
Given the fast pace, there is a one hour time window to limit the total number of samples. When a new sample comes in, the oldest is removed.
The real-time samples are processed on a regular basis to generate the historical performance statistics with different intervals defined in PerfInterval. On ESX, only 5 minute interval statistics is supported, while on VirtualCenter 4 different intervals, as listed in Table 1, are pre-configured.
Both the historical statistics and realtime samples can be retrieved using the same interfaces but with different combinations of arguments. You can check out the samples using these APIs from the vSphere Java API‘s code repository.