In: Categories » Computers and technology » Servers » COST OF DOWNTIME IS NOT A CONSTANT
Further complicating matters is the fact that the cost of downtime is not a constant. We will assume it to be constant for the purposes of our calculations (it makes them much, much simpler), but in reality, the cost of downtime increases as the duration of an outage increases. Consider again the effects of downtime on an e-commerce site. If the site suffers a brief outage (a few seconds), the cost will be minimal, perhaps even negligible. An outage of a minute or less probably will not affect business too badly: All but the most disloyal users will simply hit their browser’s reload button and try again. A 30-minute outage will cause some customers to take their business to a competitor’s site; others will be patient and keep trying. An outage of several hours will likely cause all but the most loyal customers to take their business elsewhere and will cause some of them to never return.
An outage that lasts days could result in the total failure of the business. Once a customer is lost and that customer has a more pleasant experience on a competitor’s web site, the customer will articlemark the other site and likely not return to yours. Depending on the nature of your business, an outage at 1:00 A.M. may cost less than an outage at 1:00 P.M. (Then again, it could cost more.) An outage in mid-December may cost a lot more than an outage in mid-August. Repeated, intermittent failures of 15 minutes apiece that total two hours will likely cost you more than a single two-hour outage because multiple outages can cause more user frustration than one-time outages. For the purposes of the calculations in this article, we will keep things simple and assume that cost of downtime is a constant. As a rule of thumb, consider the costs of downtime for an outage of roughly an hour. (If you have calculated more precise values, then by all means use them.) What’s more, the added complexity of trying to develop a formula for a variable cost of downtime will not help make our points in this article any better.
The Availability Continuum
Taken to its logical limit, the definition of high availability cited in the preceding text implies that every system can potentially have a different threshold of availability before it achieves high availability. This is absolutely correct. Computer systems vary widely in their tolerance of downtime. Some systems cannot handle even the briefest interruption in service without catastrophic results, whereas others can handle brief interruptions, but not extended ones, and others can even handle extended outages while still delivering on their required returns on investment.
Consider the following examples: Computers whose failures cause a loss of human life, such as hospital life support systems or avionics systems, generally have the highest requirements for availability. Slightly less critical computers can be found running e-commerce web sites such as amazon.com and ebay.com, or managing and performing equities trading at brokerage institutions around the world. Systems that operate an assembly line or other important production activities, whose failure might idle hundreds of workers, while other parts of the business can continue.
Computers in a university’s computer science department may be able to stand a week’s downtime, while professors postpone assignment deadlines and teach other material, without a huge impact. The computer that manages the billing function at a small company could be down through a whole billing cycle if backup procedures to permit manual processing are in place. A computer that has been retired from active service and sits idle in a closet has zero availability. In fact, there is a whole range of possible availability levels that range from an absolute requirement of 100 percent down to a low level, where it just doesn’t matter if the computer is running or not, as in the last bullet. We call this range the Availability Continuum, and it is depicted graphically in Figure 3.1. Every computer (in fact, every system of any kind, but let’s not overextend) in the world has availability requirements that place it somewhere on the Continuum. The hard part is figuring out where your system’s availability requirements place it on the Continuum, and then matching those requirements to a set of protective technologies.
Although it is best to determine the appropriate level availability that is required for each system and not change it, the reality is that over time just about all systems tend to drift higher on the Continuum. In Figure 3.1, we chose a few different types of systems and indicated where on the Continuum they might fall. It is best to determine an appropriate point on the Continuum for a critical system and leave it there, because it is simpler and more straightforward to design availability into a system when it is first deployed rather than add incremental improvements over time. The incremental improvements that follow reevaluations of availability needs invariably cause additional downtime, and in many cases the enhancements are not as reliable as they might have been if they had been installed at the system’s initial deployment.
Systems drift higher on the Continuum over time because cost of implementation is an important aspect of determining exactly how well protected a system needs to be. Less enlightened budget holders tend to give less money for needed protective measures than they should. When a failure occurs that falls outside the set of failures that the system has been designed to survive, the system will go down, probably for an extended period. When it becomes apparent to the budget holder that this outage is costing his business a lot of money, he will likely approve additional spending to protect against future failures. When he does that, he is nudging the system up the Continuum. The higher a system needs to be on the Continuum, the more it costs to get it there. The higher cost is necessary because in order to achieve higher levels of availability, you need to protect against more varieties of more complicated and less frequently occurring outages.
legal notice
Our website is not responsible for the information contained by this article. Web-articles is a free articles resource.
Suggestion: If you need fresh, daily updated content for your website, feel free to use our service. Click here for more information.
Useful tools and features
related articles
Network Failures Networks are naturally susceptible to failures because they contain many components and are affected by the configuration of every component. Where, exactly, is your network? In the switch? The drop cables? Bounded by all of the network interface cards in your systems? Any of those physical components can break, resulting in network outages or, more maddeningly, intermittent network failures. Networks are also affected by configuration problems. Incorrect routing information, duplicate host...
2. Web and Application Server Failures
Web and Application Server Failures The bugs that can strike a database can also affect a web server. Of course, many web servers are part of client/server applications that query back-end database servers to service client requests. So, anything affecting the database server will have an adverse effect on the web server as well. However, there are many other places within the web server environment where things might go awry. There are many new places for bugs to crop up, including in the Common Gateway Interfa...
3. Your system fails because the operating system panics
Renewability Let’s say your system fails because the operating system panics. It reboots, restarts applications such as web servers and databases, and continues on as before the failure. What’s the probability of another failure due to an operating system panic? In all likelihood, it’s exactly the same as it was before the reboot. There are many cases, however, in which repairing a system changes the MTBF characteristics of the system, increasing the probability of another failure in the near-te...
4. Direct and Indirect Costs of Downtime
The Costs of Downtime The only way to convince the people who control the purse strings that there is value in protecting uptime is to approach the problem from a dollars-andcents perspective. In this section, we provide some ammunition that should help make the case to even the most stubborn manager. Direct Costs of Downtime The most obvious cost of downtime is probably not the most expensive one: lost user productivity. The actual cost of that downtime is dependent upon what work your user...
5. The Politics of Availability
To persuade others of the value of your ideas, it is necessary to delve into the dark, shadowy world of organizational politics. Fundamentally, this means that you achieve your goals by helping (or if you aren’t particularly scrupulous, appearing to help) others around you achieve their goals, so that they then help you achieve yours. Start Inside Probably the best way to convince others of the value of your ideas is to first convince them that your ideas will help them achieve their own goals. To do that, yo...
6. Rational case that explains in nontechnical terms
Start Building the Case Once you have learned what you need to know, the next step is to begin to put together a calm and rational case that explains in nontechnical terms what the vulnerabilities, risks, and costs are. The case must include a discussion of the risks of inaction. Find Allies Ask around your organization. Look for friends and colleagues who share your concerns. Maybe you’ll find someone who has tried to convince management of something in the past. At the very l...
7. 20 Key High Availability Design Principles 1
#20: Don’t Be Cheap One of the basic rules of life in the 21st century is that quality costs money. Whether you are buying ice cream (“Do I want the Ben & Jerry’s at $4.00 per pint, or the store brand with the little ice crystals in it for 79 cents a gallon?”), cars (Rolls-Royce or Saturn), or barbecue grills, the higher the quality, the more it costs. The decision to implement availability is a business decision. It comes down to dollars and cents. If you look at the business decis...
8. Consolidate Your Servers
#16: Consolidate Your Servers The trend over the last few years in many computing circles has been to consolidate servers that run similar services. Instead of having many small singlepurpose machines or lots of machines running a single instance of a database, companies are rolling them together and putting all the relevant applications onto one or more larger servers with a capacity greater than all of the replaced servers. This setup can significantly reduce the complexity of your computing envir...
