In: Categories » Computers and technology » Servers » Definitions for downtime vary from gentle to tough
Definitions for downtime vary from gentle to tough, and from simple to complex. Easy definitions are often given in terms of failed components, such as the server itself, disks, the network, the operating system, or key applications. Stricter definitions may include slow server or network performance, the inability to restore backups, or simple data inaccessibility. We prefer a very strict definition for downtime: If a user cannot get her job done on time, the system is down. A computer system is provided to its users for one purpose: to allow them to complete their work in an efficient and timely way. When circumstances prevent a user from doing this work, regardless of the reason, the system is down.
Failure Modes
In this section, we take a quick look at the things that can go wrong with computer systems and that can cause downtime. Some of them, especially the hardware ones, may seem incredibly obvious, but others will not.
Hardware
Hardware points of failure are the most obvious ones—the failures that people will think of first when asked to provide such a list. And yet, as we saw in Figure 2.2 and Figure 2.3, they only make up less than half (possibly just a little more than a quarter, depending on whose numbers you like better) of all system outages. However, when you have a hardware outage, you may be down for a long time if you don’t have redundancy built in. Waiting for parts and service people makes you a captive to the hardware failure. The components that will cause the most failures are moving parts, especially those associated with high speeds, low tolerances, and complexity. Having all of those characteristics, disks are prime candidates for failures. Disks also have controller boards and cabling that can break or fail. Many hardware disk arrays have additional failure-prone components such as memory for caching, or hardware for mirroring or striping. Tape drives and libraries, especially digital linear tape (DLT) libraries, have many moving parts, motors that stop and start, and extremely low tolerances. They also have controller boards and many of the same internal components that disk drives have, including memory for caching. Fans are the other components with moving parts. The failure of a fan may not cause immediate system failure the way a disk drive failure will, but when a machine’s cooling fails, the effects can be most unpredictable. When CPUs and memory chips overheat, systems can malfunction in subtle ways. Many systems do not have any sort of monitoring for their cooling, so cooling failures can definitely catch many system administrators by surprise. It turns out that fans and power supplies have the worst MTBFs of all system components. Power supplies can fail hard and fast, resulting in simple downtime, or they can fail gradually.
The gradual failure of a power supply can be a very nasty problem, causing subtle, sporadic failures in the CPU, memory, or backplane. Power supply failures are caused by many factors, including varying line voltage and the stress of being turned on and off. To cover for these shortcomings, modern systems have extra fans, extra power supplies, and superior hardware diagnostics that provide for problem detection and identification as quickly as possible. Many systems can also “call home.” When a component fails, the system can automatically call the service center and request maintenance. In some cases, repair people arrive on-site to the complete surprise of the local staff. Of course, failures can also occur in system memory and in the CPU. Increasing numbers of modern systems are able to configure a failed component out of the system without a reboot. This may or may not help intermittent failures in memory or the CPU, but it will definitely help availability when a true failure occurs. There are other hardware components that can fail, although they do so very infrequently. These include the backplane, the various system boards, the cabinet, the mounting rack, and the system packaging. Environmental and Physical Failures Failures can be external to the system as well as internal. There are many components in the environment that can cause system downtime, yet these are rarely considered as potential points of failure. Most of these are data center– related, but many of them can impact your servers regardless of their placement. And in many cases, having a standby server will not suffice in these situations, as the entire environment may be affected. The most obvious environmental problem is a power failure. Power failures (and brownouts) can come from your electric utility, or they occur much more locally. Acar can run into the light pole in front of your building. The failure of a circuit breaker or fuse, or even a power strip, can shut your systems down. The night cleaning crew might unplug some vital system in order to plug in a vacuum cleaner, or their plugging in the vacuum cleaner may overload a critical circuit. The environmental cooling system can fail, causing massive overheating in all of the systems in the room.
Similarly, the dehumidifying system can fail (although that failure is not going to be as damaging to the systems in the room as a cooling failure). Most data centers contain rats’ nests of cables, under the floor and out the back of the racks and cabinets. Cables can break, and they can be pulled out. And, of course, a sudden change in the laws of physics could result in copper no longer conducting electricity. (If that happens, you probably have bigger problems.) Most data centers have fire protection systems. Halon is still being removed from data centers (apparently they get one more Halon incident, and that’s it; Halon systems cannot be refilled), but the setting off of one of these fire protection systems can still be a very disruptive event. One set of problems ensues when the fire is real, and the protection systems work properly and put the fire out. The water or other extinguishing agent can leave a great deal of residue and can leave the servers in the room unfit for operation. Halon works by displacing the oxygen in the room, which effectively chokes off the fire. Of course, displaced oxygen could be an issue for any human beings unfortunate enough to be in the room at the time. Inergen Systems (www.inergen.com) makes newer, more environmentally sound, and friendlier to oxygen-breathing life systems, that can be dropped directly into Halon systems (there are competing systems as well). The fire itself can cause significant damage to the environment. One certainly hopes that when a fire protection system is put into action, the fire is real. But sometimes it isn’t, and the fire protection system goes off when no emergency exists. This can leave the data center with real damage caused solely by a mistake. The other end of the spectrum is when a fire event is missed by the protection system. The good news is that there will be no water or other fire protection system residue. The bad news is that your once-beautiful data center may now be an empty, smoldering shell. Or worse. Another potential environmental problem is the structural failure of a supporting component, such as a computer rack or cabinet. Racks can collapse or topple when not properly constructed. If shelves are not properly fastened, they can come loose and crash down on the shelves beneath them. Looming above the cabinets in most data centers are dusty ceilings, usually with cables running through them. Ceilings can come tumbling down, raining dust and other debris onto your systems, which get sucked into the systems by cooling fans. Many data centers have some construction underway while active systems are operating nearby.
Construction workers in work boots bring heavy-duty equipment in with them and may not have any respect for the production systems that are in their way. Cables get kicked or cut, and cabinets get pushed slightly (or not so slightly) and can topple. While construction workers are constructing, they are also stirring up dust and possibly cutting power to various parts of the room. If they lay plastic tarps over your equipment to protect it from dust, the equipment may not receive proper ventilation and may overheat. And then there are the true disasters: earthquakes, tornadoes, floods, bombs and other acts of war and terrorism, or even locusts. It is important to note that some high-end fault-tolerant systems may be impacted by environmental and power issues just as badly as regular availability systems.
legal notice
Our website is not responsible for the information contained by this article. Web-articles is a free articles resource.
Suggestion: If you need fresh, daily updated content for your website, feel free to use our service. Click here for more information.
Useful tools and features
related articles
Further complicating matters is the fact that the cost of downtime is not a constant. We will assume it to be constant for the purposes of our calculations (it makes them much, much simpler), but in reality, the cost of downtime increases as the duration of an outage increases. Consider again the effects of downtime on an e-commerce site. If the site suffers a brief outage (a few seconds), the cost will be minimal, perhaps even negligible. An outage of a minute or less probably will not affect business too badly: All...
2. The Politics of Availability
To persuade others of the value of your ideas, it is necessary to delve into the dark, shadowy world of organizational politics. Fundamentally, this means that you achieve your goals by helping (or if you aren’t particularly scrupulous, appearing to help) others around you achieve their goals, so that they then help you achieve yours. Start Inside Probably the best way to convince others of the value of your ideas is to first convince them that your ideas will help them achieve their own goals. To do that, yo...
3. Rational case that explains in nontechnical terms
Start Building the Case Once you have learned what you need to know, the next step is to begin to put together a calm and rational case that explains in nontechnical terms what the vulnerabilities, risks, and costs are. The case must include a discussion of the risks of inaction. Find Allies Ask around your organization. Look for friends and colleagues who share your concerns. Maybe you’ll find someone who has tried to convince management of something in the past. At the very l...
4. 20 Key High Availability Design Principles 1
#20: Don’t Be Cheap One of the basic rules of life in the 21st century is that quality costs money. Whether you are buying ice cream (“Do I want the Ben & Jerry’s at $4.00 per pint, or the store brand with the little ice crystals in it for 79 cents a gallon?”), cars (Rolls-Royce or Saturn), or barbecue grills, the higher the quality, the more it costs. The decision to implement availability is a business decision. It comes down to dollars and cents. If you look at the business decis...
5. Consolidate Your Servers
#16: Consolidate Your Servers The trend over the last few years in many computing circles has been to consolidate servers that run similar services. Instead of having many small singlepurpose machines or lots of machines running a single instance of a database, companies are rolling them together and putting all the relevant applications onto one or more larger servers with a capacity greater than all of the replaced servers. This setup can significantly reduce the complexity of your computing envir...
6. Documentation provides audit trails to work that has been completed
#13: Document Everything The importance of good, solid documentation simply cannot be overstated. Documentation provides audit trails to work that has been completed. It provides guides for future system administrators so that they can take over systems that existed before they arrived. It can provide the system administrator and his management with accomplishment records. (These can be very handy at personnel review time.) Good documentation can also help with problem solving. 1. The first audience is the...
7. Keep your production and development environments separate
#10: Test Everything Not only do crisis plans need to be tested, so do all new applications, system software, hardware modifications, and pretty much any change at all. Ideally, testing should take place in a production-like environment, with as similar an environment to the operational one as possible, and with as much of the same hardware, networks, and applications as possible. Even better, the same users should perform the tests. The tests need to be performed with the same production network configuration and...
