In: Categories » Computers and technology » Servers » System ptime availability
When you discuss availability requirements with a user or project leader, he will invariably tell you that 100 percent availability is required: “Our project is so important that we can’t have any downtime at all.” But the tune usually changes when the project leader finds out how much 100 percent availability would cost. Then the discussion becomes a matter of money, and more of a negotiation process. As you can see in Table 2.1, for many applications, 99 percent uptime is adequate. If the systems average an hour and a half of downtime per week, that may be satisfactory. Of course, a lot of that depends on when the hour and a half occurs. If it falls between 3:00 A.M. and 4:30 A.M. on Sunday, that is going to be a lot more tolerable on many systems than if it occurs between 10:00 A.M. and 11:30 A.M. on Thursday, or every weekday at 2:00 P.M. for 15 or 20 minutes.
One point of negotiation is the hours during which 100 percent uptime may be required. If it is only needed for a few hours a day, that goal is quite achievable. For example, when brokerage houses trade between the hours of 9:30 A.M. and 4:00 P.M., then during those hours, plus perhaps 3 or 4 hours on either side, 100 percent uptime is required. Anewspaper might require 100 percent uptime during production hours, but not the rest of the day. If, however, 100 percent uptime is required 7 × 24 × 365, the costs become so prohibitive that only the most profitable applications and large enterprises can consider it, and even if they do, 100 percent availability is almost impossible to achieve over the long term. As you move progressively to higher levels of availability, costs increase very rapidly. Consider a server (abbott) that with no special protective measures taken, except for disk mirrors and backups, delivers 99 percent availability. If you couple that server with another identically configured server (costello) that is configured to take over from abbott when it fails, and that server also offers 99 percent availability, then theoretically, you can achieve a combined availability of 99.99 percent. Mathematically, you multiply the downtime on abbott (1 percent) by the uptime on costello (99 percent); costello will only be in use during abbott’s 1 percent of downtime. The result is 0.99 percent. Add the original 99 to 0.99, and you get 99.99 percent, the theoretical uptime for the combined pair. Of course, in reality 99.99 percent will not occur simply by combining two servers. The increase in availability is not purely linear. It takes time for the switchover (usually called a failover) to occur, and during that period, the combined server is down. In addition, there are external failures that will affect access to both servers, such as network connectivity or power outages. These failures will undoubtedly decrease the overall availability figures below 99.99 percent. However, we only use the “nines” for modeling purposes. In reality, we believe that the nines have become an easy crutch for system and operating system vendors, allowing them to set unrealistic expectations for uptime.
The Myth of the Nines
We’ve seen a number of advertisements proclaiming “five nines” or more of availability. This is a nice generalization to make for marketing materials, because we can measure the mean time between failures (MTBF) of a hardware system and project its downtime over the course of a year. System availability is based on software configurations, load, user expectations, and the time to repair a failure. Before you aim for a target number of nines, or judge systems based on their relative proclaimed availability, make sure you can match the advertised number against your requirements. The following are considerations to take into account when evaluating the desired availability:
Nines are an average. Maximum outages, in terms of the maximum time to repair, are more important than the average uptime. Nines only measure that which can be modeled. Load and software are hard to model in an average case; you will need to measure your actual availability and repair intervals for real systems, running real software loads. Nines usually reflect a single system view of the world. Quick: Think of a system that’s not networked but important. Reliability has to be based on networks of computers, and the top-to-bottom stack of components that make up the network. The most reliable, fault-tolerant system in the world is useless if it sits behind a misconfigured router. Computer system vendors talk about “nines of availability,” and although nines are an interesting way to express availability, they miss some essential points. All downtime is not created equal. If an outage drives away customers or users, then it is much more costly than an outage that merely inconveniences those users. But an outage that causes inconvenience is more costly to an enterprise than an outage that is not detected by users.
Consider the cost of downtime at a retail e-commerce web site such as amazon. com or ebay.com. If, during the course of a year, a single 30-minute outage is suffered, the system has an apparently respectable uptime of 99.994 percent. If, however, the outage occurs on a Friday evening in early December, it costs a lot more in lost business than the same outage would if it occurred on a Sunday at 4:00 A.M. local time in July. Availability statistics do not make a distinction between the two. Similarly, if an equities trading firm experiences a 30-minute outage 5 minutes before the Federal Reserve announces a surprise change in interest rates, it would cost the firm considerably more than the same outage would on a Tuesday evening at 8 P.M., when no rate change, and indeed, little activity of any kind, was in the offing. Consider the frustration level of a customer or user who wants to use a critical system. If the 30-minute outage comes all at once, then a user might leave and return later or the next night, and upon returning, stay if everything is OK. However, if the 30 minutes of downtime is spread over three consecutive evenings at the same time, users who try to gain access each of those three nights and find systems that are down will be very frustrated. Some of them will go elsewhere, never to return. (Remember the rule of thumb that says it costs 10 times more to find a new customer than it does to retain an old one.) Many system vendors offer uptime guarantees, where they claim to guarantee specific uptime percentages. If customers do not achieve those levels, then the vendor is contractually bound to pay their customers money or provide some other form of giveback.
There are so many factors that are out of the control of system vendors, and are therefore disallowed in the contracts, that those contracts seldom have any teeth, and even more seldom pay off. Compare, for instance, the potential reliability of a server located in a northern California data center where, in early 2001, rolling power blackouts were a way of life, with a server in, say, Minnesota, where the traditionally high amounts of winter snow are expected and do not traditionally impact electric utility service. Despite those geographical differences, system vendors offer the same uptime contractual guarantees in both places. A system vendor cannot reasonably be expected to guarantee the performance of a local electric power utility, wide area network provider, or the data center cooling equipment. Usually, those external factors are specifically excluded from any guarantees. The other problem with the nines is that availability is a chain, and any failed link in the chain will cause the whole chain to fail. Consider the diagram in Figure 2.1, which shows a simple representation of a user sitting at a client station and connected to a network over which he is working. If the seven components in the figure (client station, network, file server and its storage, and the application server, its application, and its storage) have 99.99 percent availability each, that does not translate to an end user seeing 99.99 percent availability. To keep the math simple, let’s assume that all seven components have exactly the same level of expected availability, 99.99 percent. In reality, of course, different components have different levels of expected availability, and more complex components such as networks will often have lower levels. The other assumption is that multiple failures do not occur at the same time (although they can, of course, in real life); that would needlessly complicate the math.Availability of 99.99 percent over each of seven components yields a simple formula of 0.9999 to the seventh power, which works out to 99.93 percent. That may not sound like a huge difference, but the difference is actually quite significant: Availability of 99.99 percent spread over a year is about 52 minutes downtime.
Availability of 99.93 percent spread over a year is over 6 hours of downtime. Another way to look at the math is to consider that for all practical purposes, the seven components will never be down at the same time. Since each component will be responsible for 52 minutes of downtime per year (based on 99.99 percent availability), 7 times 52 is 364 minutes, or just over 6 hours per year, or 99.93 percent. The actual path from user to servers is going to be much more complicated than the one in Figure 2.1. For example, the network cloud is made up of routers, hubs, and switches, any of which could fail and thereby lower network availability. If the storage is mirrored, then its availability will likely be higher, but the value will surely vary. The formulas also exclude many other components that could cause additional downtime if they were to fail, such as electric power or the building itself. Consider another example. Six of the seven components in the chain deliver 99.99 percent availability, but the seventh only achieves 99 percent uptime. The overall availability percentage for that chain of components will be just 98.94 percent. Great returns on investment can be achieved by improving the availability of that weakest link. So, while some single components may be able to deliver upwards of 99.99 percent availability, it is much more difficult for an entire system, from user to server, to deliver the same level. The more components there are in the chain and the more complex the chain, the lower the overall availability will be. Any bad component in the chain can lower overall availability, but there is no way for one good component to raise it above the level of the weakest link.
legal notice
Our website is not responsible for the information contained by this article. Web-articles is a free articles resource.
Suggestion: If you need fresh, daily updated content for your website, feel free to use our service. Click here for more information.
Useful tools and features
related articles
Web and Application Server Failures The bugs that can strike a database can also affect a web server. Of course, many web servers are part of client/server applications that query back-end database servers to service client requests. So, anything affecting the database server will have an adverse effect on the web server as well. However, there are many other places within the web server environment where things might go awry. There are many new places for bugs to crop up, including in the Common Gateway Interfa...
2. Your system fails because the operating system panics
Renewability Let’s say your system fails because the operating system panics. It reboots, restarts applications such as web servers and databases, and continues on as before the failure. What’s the probability of another failure due to an operating system panic? In all likelihood, it’s exactly the same as it was before the reboot. There are many cases, however, in which repairing a system changes the MTBF characteristics of the system, increasing the probability of another failure in the near-te...
3. Direct and Indirect Costs of Downtime
The Costs of Downtime The only way to convince the people who control the purse strings that there is value in protecting uptime is to approach the problem from a dollars-andcents perspective. In this section, we provide some ammunition that should help make the case to even the most stubborn manager. Direct Costs of Downtime The most obvious cost of downtime is probably not the most expensive one: lost user productivity. The actual cost of that downtime is dependent upon what work your user...
4. COST OF DOWNTIME IS NOT A CONSTANT
Further complicating matters is the fact that the cost of downtime is not a constant. We will assume it to be constant for the purposes of our calculations (it makes them much, much simpler), but in reality, the cost of downtime increases as the duration of an outage increases. Consider again the effects of downtime on an e-commerce site. If the site suffers a brief outage (a few seconds), the cost will be minimal, perhaps even negligible. An outage of a minute or less probably will not affect business too badly: All...
5. The Politics of Availability
To persuade others of the value of your ideas, it is necessary to delve into the dark, shadowy world of organizational politics. Fundamentally, this means that you achieve your goals by helping (or if you aren’t particularly scrupulous, appearing to help) others around you achieve their goals, so that they then help you achieve yours. Start Inside Probably the best way to convince others of the value of your ideas is to first convince them that your ideas will help them achieve their own goals. To do that, yo...
6. Rational case that explains in nontechnical terms
Start Building the Case Once you have learned what you need to know, the next step is to begin to put together a calm and rational case that explains in nontechnical terms what the vulnerabilities, risks, and costs are. The case must include a discussion of the risks of inaction. Find Allies Ask around your organization. Look for friends and colleagues who share your concerns. Maybe you’ll find someone who has tried to convince management of something in the past. At the very l...
7. 20 Key High Availability Design Principles 1
#20: Don’t Be Cheap One of the basic rules of life in the 21st century is that quality costs money. Whether you are buying ice cream (“Do I want the Ben & Jerry’s at $4.00 per pint, or the store brand with the little ice crystals in it for 79 cents a gallon?”), cars (Rolls-Royce or Saturn), or barbecue grills, the higher the quality, the more it costs. The decision to implement availability is a business decision. It comes down to dollars and cents. If you look at the business decis...
8. Consolidate Your Servers
#16: Consolidate Your Servers The trend over the last few years in many computing circles has been to consolidate servers that run similar services. Instead of having many small singlepurpose machines or lots of machines running a single instance of a database, companies are rolling them together and putting all the relevant applications onto one or more larger servers with a capacity greater than all of the replaced servers. This setup can significantly reduce the complexity of your computing envir...
