What’s your Downtime Worth?
By Matt Stansberry

If you’re looking at high availability for your hardware, chances are you’re considering a clustering option — using multiple standard boxes to form a highly available system. But according to fault-tolerant server vendors, companies need to take a closer look at redundant hardware before buying into potentially more expensive and more complicated clustered options.

A fault-tolerant server has redundant power supply, processors and storage. If there is a hardware failure, a lockstep transaction happens on both processors simultaneously without interruption to the system. Technicians can fix the malfunctioning hardware while the program runs on the redundant side.

Fault-tolerant servers are manufactured by specialty companies such as Santa Clara, Calif.-based NEC Solutions America Inc., and Maynard, Mass.-based Stratus Technologies, as well as Hewlett-Packard’s NonStop line.

According to Brad Lightner, director of product and solution integration for NEC, fault-tolerant servers provide five nines of uptime (99.999%), which equates to five minutes of downtime per year. He said clustering averages only three nines of uptime (99.9%), which equates to eight hours and 44 minutes of downtime per year.

Clustering has lower availability than fault-tolerant servers, due to the time involved for the clustered network to notice that a node is not functioning properly and begin a failover. The failover involves restarting applications and databases, which can be time-consuming and involves the risk of transactions being lost due to the failure.

On the other hand, redundant hardware is running mirrored applications with duplicate data. If one system fails, the other can be brought up to take its place immediately.

Tony Iams, senior analyst with Port Chester, N.Y.-based Ideas International, agreed that fault-tolerant servers offer better uptime than clustering. “Clustering ensures your workload will be available, but doesn’t promise when it will come back online again. The downtime varies on the clustering tool, how the storage is managed and other factors,” Iams said.

Lightner said software applications aren’t optimized for clustering and the connecting middleware itself can be a point of failure. According to Lightner, vendors have pushed clustering because it sells more hardware and more operating system licenses, which can work out to cost more than the premium you pay for a fault-tolerant server.

And while companies will need to calculate their own cost equations, based on what their downtime is worth, the cost of premium hardware and other factors — the hidden cost of clustering — is maintenance.

“Managing a cluster is one of the biggest challenges to using them,” Iams said. “Clusters are notoriously difficult to install and manage, and multiple operating systems need to be updated and patched. The more images you have, the greater that cost is. You have to train your staff on them. Fault-tolerant vendors can claim to reduce that by offering a single image to users.”

But not all experts are quite as keen on fault-tolerant hardware. Each approach to high availability has its pros and cons, according to Gordon Haff, analyst with Nashua, N.H.-based Illuminata.

Haff said the big downside of fault-tolerant servers is that they don’t protect you from software faults because they have one operating system image. Also, modern servers are reasonably fault tolerant now, so more failures come from software.

“If you put in a fault-tolerant server, you’re eliminating hardware faults, but it’s not a magic bullet,” Haff said. “If you put an arbitrary mix of Microsoft applications on a fault-tolerant server, it’s not clear to what degree you’ve reduced your overall downtime.”

Nonetheless, Haff admits that fault-tolerant servers do offer increased uptime, even if it is incremental. And that’s where the market for the product thrives.

“Fault-tolerant servers sell in vertical markets where people are willing to pay a premium for incrementally improved uptime,” Haff said.

The bottom line for IT pros is determining what downtime is worth. Some companies can absorb downtime, and clustering wins out in those situations. But if you operate an emergency response department, a bank or some other mission-critical organization, the seconds waiting for your cluster to failover may be more than you can afford.