Changing Cooling Requirements Leave Many Data Centers at Risk
Provided by: W. Pitt Turner IV, PE and Edward C. Koplin, PE, January 30, 2003
Based on a number of recent mechanical system audits, we find many data centers are misusing the cooling equipment already installed, have more equipment than is necessary to provide redundancy, and waste significant amounts of energy. The most pressing mechanical issue we see facing data centers is a lack of integration between heating, ventilating, and air conditioning (HVAC) components and a related lack of understanding among HVAC professionals about the unique and changing thermal-dynamic requirements of computer hardware environments.
Most data center cooling problems we observe, including outright failures, would have been avoided if the HVAC hardware had been properly integrated into an effective system. You can expect this issue will have even greater impact as the new all air-cooled mainframe hardware incorporating CMOS technology arrives on the raised floor and requires substantially higher densities of fan-side cooling.
Well over half the operating data centers we have visited have the same general set of operational deficiencies. Although collectively these deficiencies can be labeled poor cooling, they actually include inadequate maintenance, the inability to manage relative humidity, and hot spots or uneven temperatures across the raised floor.
Inadequate maintenance is evidenced by a host of indicators. Preventive maintenance regimens are often nonexistent. If there is a program in place, the work tasks are typically not completed as called for, or they are partly or improperly done. Work done by contractors is often shoddy with no quality control or independent verification of results. These statements may seem blasphemous to many facility managers, but they are the hard facts we have observed in multiple sites.
Common examples include:
- Dirty or blocked coils choking airflow
- Undercharged DX systems becoming unstable
- Control points which are installed in the wrong locations and cannot possibly control a system accurately
- Uncalibrated or damaged sensors (40% of the sensing points in one data center were found to be inoperative)
- Reversed supply and return piping
- System capacity unintentionally restricted by partially closed valves
- Solenoid-operated throttling valves which are inoperative or fail frequently due to high system pressures
- Primary/secondary pumping systems which starve either the chillers or the load and result in poor reliability and high operating costs
- Unnecessary pumps running
- Winter free-cooling systems which are not used due to temperature swings while converting to or from the economizer mode
We often see otherwise robust mechanical plants supported from a single electrical panel or feeder. Sometimes there are diverse electrical feeds to the DX units, but all the dry-cooler fans are controlled by a single 20-ampere circuit breaker. And we see systems designed so future components can only be added by shutting down the entire system. These simple yet profound design and operational errors are repeated in old and brand new facilities, as well as in large and small projects all over the country.
Based on Mechanical Systems Diagnostic Reviews (MSDR) conducted at ten sites, we feel the performance results presented in the chart below are fairly typical. (All figures are projected to the hottest day of the summer.) Without the precision portable diagnostic instrumentation brought to the site by our MSDR team, it would have been impossible for site management to identify problems except through failures or cumulative anecdotal observations of unusual plant operation.
In the chart, the left bar for each system component represents 90% of redundant production capacity. One module of capacity can fail while assuring the load on the remaining equipment does not exceed 90% of the manufacturers rating. This allows for deterioration of equipment capacity with age and does not push equipment beyond the point where failure frequency rises rapidly.
The middle bar indicates the effective usable capacity the equipment is capable of delivering based on performance measurements. For an optimal plant, the left and middle bars should be identical as any difference indicates lost or wasted capacity. In this example, effective usable computer room air conditioners (CRAC) capacity was far less than management had assumed. Unit-by-unit analysis identified reversals of supply and return piping among a host of other problems.
The right bar represents the actual measured cooling load. In this example, the measured load was much higher than the client had imagined. Plans for additional computer load could not be achieved without the immediate installation of additional chiller capacity. Also note the lack of margin for CPU mainframe water cooling. This was the result of a piping problem and was another major surprise to the client.
Stable humidity is critical, yet often elusive for many sites, achieving stable relative humidity (RH) seems to be just as elusive as obtaining an adequate maintenance budget. Many centers experience dramatic RH swings, which can be a potentially dangerous event for sensitive computer hardware.
The situation is compounded when each of the computer room air handlers (CRAH) or CRAC on the raised floor is equipped with a humidifier. A slight drift in the sensor calibration causes one unit to add humidity, while the adjacent unit is simultaneously trying to dry out the air. This not only fails to provide a stable environment, it pours significant energy down the condensate drain while increasing risk, maintenance, repair, and capital costs.
Unfortunately, the humidity problems we found in one retailers 30,000 sq. ft. data center are not uncommon. After a careful analysis of the entire HVAC system, our recommendation was to disconnect the decentralized humidification units. Stability was achieved by simplifying the control system and relying on the buildings centralized humidification capability. The resulting system saved energy and better met the humidity requirement of the computer equipment.
Data center cooling is unique
Data centers are usually low-people-density areas where there is little latent heat rejection. A central system for makeup air should be provided to keep the data center slightly pressurized relative to adjacent spaces. If the central system humidifies the makeup air required for ventilation and pressurization, there is no need to humidify at each CRAH. Investment in multiple humidification units cannot be justified nor can their associated high maintenance and repair costs, energy consumption, and human factor/water leakage risk.
In a data center, the presence of any more than a trace of water in condensate drains is an indication of wasted energy and reduced cooling capacity. Condensate water is produced by dehumidification and requires the air be overcooled as it passes through the CRAH cooling coil. Since the air emerges too cold to be used, it must then be reheated. Both steps cost energy and reduce capacity. We often find chilled water temperatures too low which shifts cooling coil performance toward dehumidification and lowers the available sensible cooling capacity. Out-of-calibration controls are also a common problem. Raising the chilled water temperature may actually increase useable capacity.
The data center for an East Coast financial firm went from wet cooling coils (continuous dehumidification) to dry coils merely by raising the chilled water temperature and managing the moisture content in the makeup air. Dry coils equate to increased available capacity, energy savings, and simplified maintenance. As a good rule of thumb, if dehumidification is not required, don’t do it.
Hot spots are another common problem. Room temperature is typically used as the primary indication of capacity requirements. Therefore, logic might dictate that if the room is too warm, then more cooling units are required. But, adding capacity works only some of the time and often seriously compounds the temperature problem rather than solving it. Additional capacity necessarily increases airflow, and unless the addition is carefully engineered, the increased air velocity can create a wind tunnel under the raised floor.
Recall the principle that static pressure is required to force air out from under the raised floor plenum. Increased air velocity reduces potential static pressure. In an under floor area, the air can be moving so fast that sufficient static pressure to deliver an adequate volume of cooling air up through the floor may not develop for 30 or 40 feet beyond the point of fan discharge.
The result of excess air velocity is shown above and represents a condition often found in troubled data centers. Close to the CRAH air discharge, there is not enough static pressure to move available cooling air up through the perforated tiles or into the computer hardware cutouts. We have seen many cases where air is actually being sucked from above the raised floor down into the supply plenum under the floor. Hot spots are the inevitable result.
We have also seen air-cooled mainframe installations where a surrounding wall of CRAHs could not maintain proper temperatures. Reducing the amount of air delivered to the under floor plenum may actually solve hot spot problems. Adding more capacity can make existing problems worse.
Symptoms of a high-velocity pressure problem are low air movement through perforated tiles or floor openings and the need for air foils to scoop air from under the floor and divert it into hardware cabinets.
New generation of IBM mainframes requires all air-side cooling
Starting with the CMOS processors introduced in 1995, IBM mainframe computers will no longer require a direct connection to chilled water. All heat will be rejected to air usually supplied from the raised floor plenum.
It is important to note the physical size of the new CMOS computers is substantially smaller than previous mainframes, and total power consumption is also reduced. As a result, the density of power consumed drops from 230 watts/sq. ft. to 95 watts/sq. ft. including the space required for service clearance within the mainframe footprint. But, the shift from a combination of both air and water heat rejection to exclusively air-side heat rejection will effectively triple the volume of cooling air required per square foot of raised floor. Initially this should not be a problem for most centers.
However, if the white space vacated by the former mainframe(s) is occupied by additional equipment, severe cooling capacity problems could develop. Installation of additional air-cooling capacity may not be fully successful, especially with raised floors less than 18″. Any existing air velocity problems will clearly be made substantially worse.
File server farms
Until recently, conventional wisdom was that the mainframe would be replaced by distributed end-user computing, and data centers would be a thing of the past. What we are actually seeing is wholesale relocation of file servers to the data center. While still atypical, we have been in several centers which already have 500 to 1,000 servers with an additional 25 to 50 servers arriving weekly. In order to best utilize available space, data center managers are stacking servers five and six high in special racks. This dense packing results in file server farms which have cooling requirements very similar to current IBM 3390 DASD farms.
This trend results in a vertical data center as opposed to the former horizontal data center where each computer hardware manufacturer had total responsibility for the packaging and internal air movement within his footprint as well as direct access to cooling from under the raised floor. Without similar engineering and attention to detail (which can only done by the end user in a home-grown mix and match vertical environment), dense packing is going to get many data centers into very serious reliability problems that will ultimately destroy their computer hardware investment.
Many of the closed-cabinet vertical racking systems we have seen are derived from bakery bread racks or industrial shelving. The shelves in such systems can block the convection flow of air resulting in extreme temperature conditions at the top. Muffin fans are typically used to push air out the top. None of the systems have redundant fans or any type of alarming for when the single fan fails. Even with fans, we have measured temperatures above 100F which dramatically reduces long-term reliability. As a general rule of thumb, for every increase of 18F above 70F, long-term electronics reliability is reduced by 50%.
Planning for the future now
As companies consolidate their information assets into fewer data centers, it becomes increasingly important to make what remains as robust and fault tolerant as possible. Continuous information availability means the old practice of scheduled downtime to work on HVAC equipment will become a distant memory. All air-side cooling will put new stresses on mechanical systems revealing deficiencies which may have always existed, but were not show stoppers. Data center facility managers should now begin planning to meet increasing performance requirements.
A holistic approach is required to integrate data center HVAC equipment with other mechanical systems including the electrical system and with every other environmental infrastructure support system required to allow continuous, uninterrupted information availability. Management commitment to a higher level of infrastructure performance needs to begin early. Every piece of equipment, every system, and all the integrated systems that comprise a new data centers environmental infrastructure need to be designed to a common reliability, availability, and maintainability (RAM) standard. Systems must incorporate fault-tolerance capabilities which allow failures to occur without affecting the computer load. All maintenance and repair activities must be accomplished without an interruption in the delivery of cooling to the data center.
We will see more companies following the path of Pacific Bell and United Parcel Service in using new approaches to increase overall center reliability. One critical step is to introduce thermal storage to provide sufficient cooling ridethrough to match the electrical ridethrough of the uninterruptible power system’s batteries. We will no longer be able to assume the ambient air by itself can store sufficient cooling to ride through an extended power failure.
Every new HVAC system should be thoroughly commissioned to validate original design intent and demonstrate the actual delivered performance. And the work should be done by a third party independent of the installing contractor. For existing sites, the first step is diagnosing the actual operation of the existing mechanical plant. In case after case, we have found 10% to 30% of the existing capacity is either wasted or cannot be realized because of design deficiencies or operational problems.
The time to act is before problems occur. Certainly any data center which has had problems controlling temperature and humidity with water-cooled mainframes is going to encounter problems as the all air-cooled equipment is brought on line.
Diagnosing and then correcting deficiencies via a midlife tune-up can generate major savings (which can fund needed modifications) and significant performance improvements.
2000 Computersite Engineering, Inc.
To learn more about PTS consulting services to support Air Conditioning Equipment & Systems deployments and support, contact us or visit:
- PTS Data Center Planning & Pre-Design
- Computational Fluid Dynamics (CFD) Services
- Data Center Energy Usage Assessment Service
- Data Center Power & Cooling Systems Analysis
- Data Center Availability & Risk Assessment
- Data Center Design Commissioning
- IT Implementation Services
- Data Center Support Infrastructure Equipment Services
- Raised Floor Cooling Analysis
- The Challenges of Data Center Cooling
- Help With High Density Cooling
- Data Center Feasibility Analysis
- Data Center Site Assessment
- Thermography Assessment Service
- PTS Virtualization Strategies & Assessments
- IT Business Continuity & Disaster Recovery
- Data Center Routing and Switching Planning & Feasibility
- Data Center Server Planning & Feasibility
- Data Center Storage Area Network (SAN) and Data Protection Planning & Feasibility