Your system fails because the operating system panics

an article added by: Ben Smeider at 11272007


Servers :: Your system fails because the operating system panics ::

 French | Spanish | Portuguese | Italian | German | Japanese | Chinese | Korean | Russian | Arabic Bookmark and Share

Renewability

Let’s say your system fails because the operating system panics. It reboots, restarts applications such as web servers and databases, and continues on as before the failure. What’s the probability of another failure due to an operating system panic? In all likelihood, it’s exactly the same as it was before the reboot. There are many cases, however, in which repairing a system changes the MTBF characteristics of the system, increasing the probability of another failure in the near-term future. When you replace a punctured tire on your car with the “doughnut” spare tire, the MTBF for tire problems isn’t the same as when you were running on four original tires; the doughnut has speed and distance restrictions on it that make it less reliable than a new tire. You’ve repaired your car so that it’s functional again, but you haven’t restored the car to its original MTBF specifications.

The concept of repairing a system such that its MTBF remains the same is called renewability.1 Systems that aren’t renewable degrade over time. Software, in general, may not be renewable because of issues like memory leaks or memory corruption that increase the probability of failure over time. Fixing one failure may not restore the system to its original state; if you’ve ever done a regular reboot or preventative maintenance reboot, you’ve aimed to make a system renewable the brute force way. In all of the examples and scenarios we describe, we are aiming to make systems renewable. Repairing a failed component, whether hardware or software, shouldn’t affect the expected time before the next failure. When evaluating an availability technique, the key question to ask is “Will this repair restore the system to its original state so that all of my assumptions about failure modes, failure rates, and repair processes are the same as they were before I made the repair?” Answer “yes” and you can have confidence that your MTBF numbers will stand up after a series of failures and associated repairs.

Sigmas and Nines

Six-sigma methodology is another popular trend that drives us to be datafocused and process-intense. The heart of six-sigma methodology is to measure something, find out where defects are being introduced, and then remove the source of the defects so that the resulting process shows less than six defects per million opportunities for a defect (six sigmas or standard deviations away from the mean). Though this methodology is most commonly used for manufacturing processes and hard goods, it has applicability to reliability of networked systems as well. Instead of thinking about six sigma as a searchand- destroy process for defects, think about it as a way of reducing variation.

  • What are the values that users find most critical in your systems? Response time? Consistent behavior? Correct behavior? These are the critical-to-quality (CTQ) variables that you can measure.
  • Define failures, or defects, based on these CTQs. If a transaction is expected to complete in 10 seconds, and it runs for 30 seconds but eventually completes correctly, is that a failure? Is it a defect?
  • Can you relate these user CTQs to components in the system? Where are the defects introduced? What are the sources of variation, and how can you control those system components by changing their availability characteristics?

Six-sigma methodology can take your thinking about availability from the binary uptime-versus-downtime model to one in which you look at the user experience. It requires that you measure variables that are related to things you can control, reducing variability by removing the cause of defects in the process. If you’ve defined a long-running transaction as a defect, then capacity planning and resource allocation become part of your remediation. If that variability in response time is caused by system behavior during a failover, then you may have to design for a more complex recovery model.

The Value of Availability

Fundamentally, high availability is a business decision. Computers cost money to operate. They cost additional money should they fail to operate when they are expected to. But the fundamental reason enterprises invest in computers (or anything else, for that matter) is to make them money. Computers enable an organization to perform tasks it could not perform without the computer. Computers can do things that people cannot; they do things faster and more cheaply and more accurately than people can. (Not everything, but many things.) When a computer is not performing the function for which it was purchased, it is not making its owners money; it is, instead, costing them money. Since downtime can, unchecked, go on forever, there is ostensibly no limit to the costs that a down computer might generate.

What Is High Availability?

There was a period of time when your authors debated taking the phrase “high availability” out of the title of this article. The argument for doing so was that the term had become so muddied by vendor marketing organizations it had lost all meaning. The argument against removing it was that there was no other term that so well summed up what we were trying accomplish with the article. In the end, we decided that if we took “high availability” out of the title, nobody would ever be able to find the article, and if we let it stay, we would have the opportunity to define it ourselves. So we left it in. Think of it as a marketing decision. If you ask around, you’ll find that there really is no hard definition for high availability or a firm threshold that determines whether or not a particular system has achieved it. Vendors have molded the term to fit their needs. Just about every system and OS vendor with a marketing department claims to deliver high availability in one form or another. The truth is that despite claims of 7 × 24 × whatever, or some number of nines, those claims mean remarkably little in practical day-to-day system availability. The Storage Network Industry Association (SNIA) has an excellent online technical dictionary (www.snia.org/dictionary), in which they define high availability as follows: The ability of a system to perform its function continuously (without interruption) for a significantly longer period of time than the reliabilities of its individual components would suggest.

High availability is most often achieved through failure tolerance. High availability is not an easily quantifiable term. Both the bounds of a system that is called highly available and the degree to which its availability is extraordinary must be clearly understood on a case-by-case basis. Availability is pretty clearly defined, but it’s high that is the problem. Is a 20-story building high? In Manhattan, Kansas, it would be, but in Manhattan, New York, a 20-story building is lost in the crowd. It’s very much a relative term. How high is up? How up is high? How available does something have to be for it to be highly available? Greater than normal? What is normal, and who defines it? Again, not much help in these definitions. Developing a practical definition for high availability will require still another approach. Consider why someone implements a computer system. Someone spends money to purchase (or lease) a computer system. The goal, as it is with any business expenditure, is to get some sort of return, or value back, on that spending. Money that is spent with the intent of getting value back is an investment. The goal, then, is to achieve an appropriate return on the investment made to implement the computer system. The return on that investment need not be directly monetary. In an academic environment, for example, the return may be educational. A computer science department at a university or high school buys computers with the noble goal of teaching their students how to use those computers. Of course, in the long run, a computer science department that develops a good reputation gets a financial return in increased attendance in classes and tuition.

The educational computers at a university would not be considered critical by most commercial enterprises, but if those computers are down for so much of the time during a semester that students are unable to complete their assignments, then the computers are not able to generate an appropriate return on the financial investment placed in them. If these outages occur often enough, and last long enough, the department may develop a reputation for having lousy computers, or lousy computer administration, which, in either case, reflects very poorly on the department and could, over time, affect enrollment. The same is true for any computers at any enterprise; computers that are down are not doing the job for which they were implemented. Consider, then, that a system is highly available when it is available enough of the time to generate the return for which it was implemented in the first place. To be fair, it requires a clear vision of the future to know whether a system is adequately protected against all possible events, and that is plainly impossible. So, we consider high availability to be a design goal rather than an actual design. When a system is designed, it should be clear to its designers what requirements it has for availability. If the system is truly designed to those requirements, then the system is highly available. Our definition of high availability, therefore, is as follows: High availability, n. A level of system availability implied by a design that is expected to meet or exceed the business requirements for which the system is implemented. High availability, then, is a trade-off between the cost of downtime and the cost of the protective measures that are available to avoid or reduce downtime.

legal disclaimer

Our website is not responsible for the information contained by this article. Web-articles is a free articles resource.
Suggestion: If you need fresh, daily updated content for your website, feel free to use our service. Click here for more information.

related articles

1. System ptime availability
Measuring Availability When you discuss availability requirements with a user or project leader, he will invariably tell you that 100 percent availability is required: “Our project is so important that we can’t have any downtime at all.” But the tune usually changes when the project leader finds out how much 100 percent availability would cost. Then the discussion becomes a matter of money, and more of a negotiation process. As you can see in Table 2.1, for many applications, 99 percent uptim...

2. Definitions for downtime vary from gentle to tough
Defining Downtime Definitions for downtime vary from gentle to tough, and from simple to complex. Easy definitions are often given in terms of failed components, such as the server itself, disks, the network, the operating system, or key applications. Stricter definitions may include slow server or network performance, the inability to restore backups, or simple data inaccessibility. We prefer a very strict definition for downtime: If a user cannot get her job done on time, the system is down. A computer syste...

3. File and Print Server Failures
Network Failures Networks are naturally susceptible to failures because they contain many components and are affected by the configuration of every component. Where, exactly, is your network? In the switch? The drop cables? Bounded by all of the network interface cards in your systems? Any of those physical components can break, resulting in network outages or, more maddeningly, intermittent network failures. Networks are also affected by configuration problems. Incorrect routing information, duplicate host...

4. Web and Application Server Failures
Web and Application Server Failures The bugs that can strike a database can also affect a web server. Of course, many web servers are part of client/server applications that query back-end database servers to service client requests. So, anything affecting the database server will have an adverse effect on the web server as well. However, there are many other places within the web server environment where things might go awry. There are many new places for bugs to crop up, including in the Common Gateway Interfa...

5. Direct and Indirect Costs of Downtime
The Costs of Downtime The only way to convince the people who control the purse strings that there is value in protecting uptime is to approach the problem from a dollars-andcents perspective. In this section, we provide some ammunition that should help make the case to even the most stubborn manager. Direct Costs of Downtime The most obvious cost of downtime is probably not the most expensive one: lost user productivity. The actual cost of that downtime is dependent upon what work your user...

6. COST OF DOWNTIME IS NOT A CONSTANT
Further complicating matters is the fact that the cost of downtime is not a constant. We will assume it to be constant for the purposes of our calculations (it makes them much, much simpler), but in reality, the cost of downtime increases as the duration of an outage increases. Consider again the effects of downtime on an e-commerce site. If the site suffers a brief outage (a few seconds), the cost will be minimal, perhaps even negligible. An outage of a minute or less probably will not affect business too badly: All...

7. The Politics of Availability
To persuade others of the value of your ideas, it is necessary to delve into the dark, shadowy world of organizational politics. Fundamentally, this means that you achieve your goals by helping (or if you aren’t particularly scrupulous, appearing to help) others around you achieve their goals, so that they then help you achieve yours. Start Inside Probably the best way to convince others of the value of your ideas is to first convince them that your ideas will help them achieve their own goals. To do that, yo...