Modius Data Center Blog

Visualize Data Center Site Performance

Posted by Jay Hartley, PhD on Wed, Jul 06, 2011 @ 07:19 PM

There has been plenty of discussion of PUE and related efficiency/effectiveness metrics of late (Modius PUE Blog posts: 1, 2, 3). How to measure them, where to measure, when to measure, and how to indicate which variation was utilized. Improved efficiency can reduce both energy costs and the environmental impact of a data center. Both are excellent goals, but it seems to me that the most common driver for improving efficiency is a capacity problem. Efficiency initiatives are often started, or certainly accelerated, when a facility is approaching its power and/or cooling limits, and the organization is facing a capital expenditure to expand capacity.

When managing a multi-site enterprise, understanding the interaction between capacity and efficiency becomes even more important. Which sites are operating most efficiently? Which sites are nearing capacity? Which sites are candidates for decommissioning, efficiency efforts, or capital expansion?

For now, I will gracefully skip past the thorny questions about efficiency metrics that are comparable across sites. Let’s postulate for a moment that a reasonable solution has been achieved. How do I take advantage of it and utilize it to make management decisions?

Consider looking at your enterprise sites on a “bubble chart,” as in Figure 1. A bubble chart enables visualization of three numeric parameters in a single plot. In this case, the X axis shows utilized capacity. The Y axis shows PUE. The size of each bubble reflects the total IT power load.

Before going into the gory details of the metrics being plotted, just consider in general what this plot tells us about the sites. We can see immediately that three sites are above 80% capacity. Of the three, the Fargo site is clearly the largest, and is operating the most inefficiently. That would be the clear choice for initiating an efficiency program, ahead of even the less-efficient sites at Chicago and Orlando, which are not yet pushing their capacity limits. One might also consider shifting some of the IT load, if possible, to a site with lower PUE and lower utilized capacity, such as Detroit.

Data Center, Efficiency, Capacity

In this example, I could have chosen to plot DCiE (Data Center Infrastructure Efficiency)  vs. available capacity, rather than the complementary metrics PUE vs. utilized capacity. This simply changes the “bad” quadrant from upper right to lower left. Mainly an individual choice.

Efficiency is also generally well-bounded as a numeric parameter, between 0 and 100, while PUE can become arbitrarily large. (Yes, I’m ignoring the theoretical possibility of nominal PUE less than 1 with local renewable generation. Which is more likely in the near future, a solar data center with a DCiE of 200% or a start-up site with a PUE of 20?) Nonetheless, PUE appears to be the metric of choice these days, and it works great for this purpose.

Whenever presenting capacity as a single number for a given site, one should always present the most-constrained resource. When efficiency is measured by PUE or a similar power-related metric, then capacity should express either the utilized power or cooling capacity, whichever is greater. In a system with redundancy, be sure to that into account

The size of the bubble can, of course, also be modified to reflect total power, power cost, carbon footprint, or whatever other metric is helpful in evaluating the importance of each site and the impact of changes.

This visualization isn’t limited to comparing across sites. Rooms or zones within a large data center could also be compared, using a variant of the “partial” PUE (pPUE) metrics suggested by the Green Grid. It can also be used to track and understand the evolution of a single site, as shown in Figure 2.

This plot shows an idealized data-center evolution as would be presented on the site-performance bubble chart. New sites begin with a small IT load, low utilized capacity, and a high PUE. As the data center grows, efficiency improves, but eventually it reaches a limit of some kind. Initiating efficiency efforts will regain capacity, moving the bubble down and left. This leaves room for continued growth, hopefully in concert with continuous efficiency improvements.

Finally, when efficiency efforts are no longer providing benefit, capital expenditure is required at add capacity, pushing the bubble back to the left.

Those of you who took Astronomy 101 might view Figure 2 as almost a Hertzsprung-Russell diagram for data centers!

Whether tracking the evolution of a single data center, or evaluating the status of all data centers across the enterprise, the Data Center Performance bubble chart can help understand and manage the interplay between efficiency and capacity.

Data Center Capacity

Topics: Capacity, PUE, data center capacity, data center management, data center operations, DCIM

Illuminating DCIM tools: Asset Management vs. Real-time Monitoring

Posted by Donald Klein on Wed, Dec 15, 2010 @ 11:26 AM

Gartner DCIM ModiusIn the news recently, there has been a lot of discussion around a new category of software tools focusing on unified facilities and IT management in the data center.  These tools have been labeled by Gartner as Data Center Infrastructure Management (DCIM), of which Modius OpenData is a leading example (according to Gartner).

In reality, there are multiple types of tools in this category - Asset Management systems and Real-time Monitoring systems like Modius.  The easiest way to understand the differences is to reflect on two key elements: 

  • How the tools get the data?
  • And how time critical is the data?

Generally speaking, data center Asset Management systems, like nlyte, Vista, Asset-Point, Alphapoint, etc., are all reliant on 3rd party sources to either facilitate data entry of IT device 'face plate' specs, or are fed collected data for post process integration. 

The data processing part is what these systems do very effectively, in that they can build a virtual model of the data center and can often predict what will happen to the model based on equipment 'move, add or change' (MAC). These products are also strong at utilizing that model to build capacity plans for physical infrastructure, specifically power, cooling, space, ports, and weight. 

To ensure that the data used is as reliable as possible the higher priced systems contain full work-flow and ticketing engines. The theory being that by putting in repeatable processes and adhering to them, the MAC will be entered correctly in the system. To this day, I have not seen a single deployed system that is 100% accurate.  But for the purposes they are designed for (capacity and change management), these systems work quite well.

Real time accurate dataHowever, these systems are typically not used for real-time alarm processing and notification as they are not, 1) Real-time, and 2) Always accurate.

Modius takes a different approach.  As compared with Asset Management tools, Modius gets its data DIRECTLY from the source (i.e. the device) by communicating in its native protocol (like Modbus, BACnet, and SNMP) versus theoretical 'face plate' data from 3rd party sources.  The frequency of data collection can vary from 1 poll per minute, to 4 times a minute (standard), all the way down to the ½ second.  This data is then collected, correlated, alarmed, stored and can be reported over minutes, hours, days, weeks, months or years. The main outputs of this data are twofold:

  • Modius AlarmsCentralized alarm management across all categories of equipment (power, cooling, environmental sensors, IT devices, etc.)
  • Correlated performance measurement and reporting across various catagories (e.g. rack, row, zone, site, business unit, etc.)

Modius has pioneered real-time, multi-protocol data collection because the system has to be accurate 100% of the time.  Any issue in data center infrastructure performance could lead to a failure that could affect the entire infrastructure.  This data is also essential in optimizing the infrastructure in order to lower cooling costs, increase capacity, and better management equipment.

Both types of tools -- Asset Management tools and Real-time Monitoring systems -- possess high value to data center operators utilizing different capabilities.  The Asset tools are great for planning, documenting, and determining the impacts of changes in the data center.  Modius real-time monitoring interrogates the critical infrastructure to make sure systems are operating correctly, within environmental tolerances, and established redundancies.  Both are complimentary tools in maintaining optimal data center performance.

Because of this inherent synergy, Modius actively integrates with as many Asset Management tools as possible, and supports a robust web services interface for bi-directional data integration. To find out more, please feel free to contact Modius directly at info@modius.com.

Topics: Data-Collection-and-Analysis, data center capacity, data center operations, real-time metrics, Data-Collection-Processing, data center infrastructure, IT Asset Management

Measuring Available Redundant Capacity (ARC) in the Data Center

Posted by Jay Hartley, PhD on Fri, Dec 18, 2009 @ 07:00 AM

One of the key power usage metrics that I often find our customers requesting is  Available Redundant Capacity (ARC). This metric can mean different things to different people, but in simple terms, we at Modius like to define it as the amount of IT load that can be added to a data center system as a whole without sacrificing redundancy.

When viewed from the rack, row, room, or building level (or even across a network of data centers at the enterprise level), ARC provides a simple way to answer the question: “Where can I safely add new IT equipment without overloading and potentially bringing down my facility?”

Typically, most data centers don’t calculate ARC. Instead, operators set a simple alarm threshold on the Actual Loadof each device. For example, if the power load reaches 50% on a device (or more often 40% when de-rating), then the device or the monitoring system will throw an alarm.

However, this simple approach to thresholding based on device power usage doesn’t effectively capture all the conditions of the broader power distribution system. There can be hidden capacity that allows for safe failover, even though simple device-level thresholding suggests otherwise.

The goal of system ARC is to identify where you can handle additional load without sacrificing system redundancy. To calculate ARC for power of a device in a dual-feed situation, the calculation is simply:

ARC = {Device Capacity}/2 – {Actual Load}

In most cases, the Device Capacity will be de-rated to allow for some margin. In the case of power capacity, it is common to de-rate apparent power (kVA) capacity by 80%. ARC can also be expressed in real power (kW) if you know or can estimate the power factor of the load. It is even more important to de-rate the capacity in the case kW measurements to allow for potential load problems that could degrade power factor.

Below is an ARC-based dashboard in action:

Here, the top panel shows how ARC has been calculated for 6 different data centers, along with a measure of cooling overhead. The lower panel shows the drill down for one of the sites.

When calculating the overall ARC for devices in parallel, you can add the ARCs of the individual units. For instance:

UPS A has 10 kVA ARC
UPS B has 8 kVA ARC
Together, they have 18 kVA ARC
Interestingly, it is possible to have a safely redundant system even though one of the individual devices has a negative ARC. For example:

UPS A has 3 kVA ARC
UPS B has −2 kVA ARC
The net ARC of the system is a small but safely positive 1 kVA
In this case, even though one UPS is nominally overloaded according to the simple one-device threshold, either UPS can fail without dropping any load.

Calculating system ARC from the individual device ARCs in this way assumes that the capacities of both parallel components are the same. This is most often the case, but in the rare instance that it is not, then you have to total the actual load across the devices, and compare it to the (de-rated) capacity of the smaller device. This ensures that the most-limited device can handle the entire load.

Some questions may arise when the load is imbalanced, as in the examples above. Such imbalances may arise because some of the load is not configured redundantly. Some loads also do not balance themselves between the two power paths. The ARC calculation doesn’t depend on knowing such details. Of course, any non-redundant load will be dropped if it loses its power source; however, as long as the system ARC is positive you know that any redundant load will be protected regardless of which power source is lost.

In summary, the goal of system ARC is to identify where you can handle additional load without sacrificing system redundancy. With parallel equipment, you can total the ARC of all components if they have the same capacity rating. When looking at ARC along the power chain, the correct system value will be the minimum ARC of any one set of components.

Kind regards,

Jay H. Hartley, PhD
Director of Professional Services
Jay.Hartley@Modius.com

Topics: Data-Center-Best-Practices, data center monitoring, Dr-Jay, data center capacity, data center energy efficiency, Measurements-Metrics, Capacity-Management

Latest Modius Posts

Posts by category

Subscribe via E-mail