In: Categories » Computers and technology » Servers » Your system fails because the operating system panics
Let’s say your system fails because the operating system panics. It reboots, restarts applications such as web servers and databases, and continues on as before the failure. What’s the probability of another failure due to an operating system panic? In all likelihood, it’s exactly the same as it was before the reboot. There are many cases, however, in which repairing a system changes the MTBF characteristics of the system, increasing the probability of another failure in the near-term future. When you replace a punctured tire on your car with the “doughnut” spare tire, the MTBF for tire problems isn’t the same as when you were running on four original tires; the doughnut has speed and distance restrictions on it that make it less reliable than a new tire. You’ve repaired your car so that it’s functional again, but you haven’t restored the car to its original MTBF specifications.
The concept of repairing a system such that its MTBF remains the same is called renewability.1 Systems that aren’t renewable degrade over time. Software, in general, may not be renewable because of issues like memory leaks or memory corruption that increase the probability of failure over time. Fixing one failure may not restore the system to its original state; if you’ve ever done a regular reboot or preventative maintenance reboot, you’ve aimed to make a system renewable the brute force way. In all of the examples and scenarios we describe, we are aiming to make systems renewable. Repairing a failed component, whether hardware or software, shouldn’t affect the expected time before the next failure. When evaluating an availability technique, the key question to ask is “Will this repair restore the system to its original state so that all of my assumptions about failure modes, failure rates, and repair processes are the same as they were before I made the repair?” Answer “yes” and you can have confidence that your MTBF numbers will stand up after a series of failures and associated repairs.
Sigmas and Nines
Six-sigma methodology is another popular trend that drives us to be datafocused and process-intense. The heart of six-sigma methodology is to measure something, find out where defects are being introduced, and then remove the source of the defects so that the resulting process shows less than six defects per million opportunities for a defect (six sigmas or standard deviations away from the mean). Though this methodology is most commonly used for manufacturing processes and hard goods, it has applicability to reliability of networked systems as well. Instead of thinking about six sigma as a searchand- destroy process for defects, think about it as a way of reducing variation.
- What are the values that users find most critical in your systems? Response time? Consistent behavior? Correct behavior? These are the critical-to-quality (CTQ) variables that you can measure.
- Define failures, or defects, based on these CTQs. If a transaction is expected to complete in 10 seconds, and it runs for 30 seconds but eventually completes correctly, is that a failure? Is it a defect?
- Can you relate these user CTQs to components in the system? Where are the defects introduced? What are the sources of variation, and how can you control those system components by changing their availability characteristics?
Six-sigma methodology can take your thinking about availability from the binary uptime-versus-downtime model to one in which you look at the user experience. It requires that you measure variables that are related to things you can control, reducing variability by removing the cause of defects in the process. If you’ve defined a long-running transaction as a defect, then capacity planning and resource allocation become part of your remediation. If that variability in response time is caused by system behavior during a failover, then you may have to design for a more complex recovery model.
The Value of Availability
Fundamentally, high availability is a business decision. Computers cost money to operate. They cost additional money should they fail to operate when they are expected to. But the fundamental reason enterprises invest in computers (or anything else, for that matter) is to make them money. Computers enable an organization to perform tasks it could not perform without the computer. Computers can do things that people cannot; they do things faster and more cheaply and more accurately than people can. (Not everything, but many things.) When a computer is not performing the function for which it was purchased, it is not making its owners money; it is, instead, costing them money. Since downtime can, unchecked, go on forever, there is ostensibly no limit to the costs that a down computer might generate.
What Is High Availability?
There was a period of time when your authors debated taking the phrase “high availability” out of the title of this article. The argument for doing so was that the term had become so muddied by vendor marketing organizations it had lost all meaning. The argument against removing it was that there was no other term that so well summed up what we were trying accomplish with the article. In the end, we decided that if we took “high availability” out of the title, nobody would ever be able to find the article, and if we let it stay, we would have the opportunity to define it ourselves. So we left it in. Think of it as a marketing decision. If you ask around, you’ll find that there really is no hard definition for high availability or a firm threshold that determines whether or not a particular system has achieved it. Vendors have molded the term to fit their needs. Just about every system and OS vendor with a marketing department claims to deliver high availability in one form or another. The truth is that despite claims of 7 × 24 × whatever, or some number of nines, those claims mean remarkably little in practical day-to-day system availability. The Storage Network Industry Association (SNIA) has an excellent online technical dictionary (www.snia.org/dictionary), in which they define high availability as follows: The ability of a system to perform its function continuously (without interruption) for a significantly longer period of time than the reliabilities of its individual components would suggest.
High availability is most often achieved through failure tolerance. High availability is not an easily quantifiable term. Both the bounds of a system that is called highly available and the degree to which its availability is extraordinary must be clearly understood on a case-by-case basis. Availability is pretty clearly defined, but it’s high that is the problem. Is a 20-story building high? In Manhattan, Kansas, it would be, but in Manhattan, New York, a 20-story building is lost in the crowd. It’s very much a relative term. How high is up? How up is high? How available does something have to be for it to be highly available? Greater than normal? What is normal, and who defines it? Again, not much help in these definitions. Developing a practical definition for high availability will require still another approach. Consider why someone implements a computer system. Someone spends money to purchase (or lease) a computer system. The goal, as it is with any business expenditure, is to get some sort of return, or value back, on that spending. Money that is spent with the intent of getting value back is an investment. The goal, then, is to achieve an appropriate return on the investment made to implement the computer system. The return on that investment need not be directly monetary. In an academic environment, for example, the return may be educational. A computer science department at a university or high school buys computers with the noble goal of teaching their students how to use those computers. Of course, in the long run, a computer science department that develops a good reputation gets a financial return in increased attendance in classes and tuition.
The educational computers at a university would not be considered critical by most commercial enterprises, but if those computers are down for so much of the time during a semester that students are unable to complete their assignments, then the computers are not able to generate an appropriate return on the financial investment placed in them. If these outages occur often enough, and last long enough, the department may develop a reputation for having lousy computers, or lousy computer administration, which, in either case, reflects very poorly on the department and could, over time, affect enrollment. The same is true for any computers at any enterprise; computers that are down are not doing the job for which they were implemented. Consider, then, that a system is highly available when it is available enough of the time to generate the return for which it was implemented in the first place. To be fair, it requires a clear vision of the future to know whether a system is adequately protected against all possible events, and that is plainly impossible. So, we consider high availability to be a design goal rather than an actual design. When a system is designed, it should be clear to its designers what requirements it has for availability. If the system is truly designed to those requirements, then the system is highly available. Our definition of high availability, therefore, is as follows: High availability, n. A level of system availability implied by a design that is expected to meet or exceed the business requirements for which the system is implemented. High availability, then, is a trade-off between the cost of downtime and the cost of the protective measures that are available to avoid or reduce downtime.
legal notice
Our website is not responsible for the information contained by this article. Web-articles is a free articles resource.
Suggestion: If you need fresh, daily updated content for your website, feel free to use our service. Click here for more information.
Useful tools and features
related articles
Further complicating matters is the fact that the cost of downtime is not a constant. We will assume it to be constant for the purposes of our calculations (it makes them much, much simpler), but in reality, the cost of downtime increases as the duration of an outage increases. Consider again the effects of downtime on an e-commerce site. If the site suffers a brief outage (a few seconds), the cost will be minimal, perhaps even negligible. An outage of a minute or less probably will not affect business too badly: All...
2. The Politics of Availability
To persuade others of the value of your ideas, it is necessary to delve into the dark, shadowy world of organizational politics. Fundamentally, this means that you achieve your goals by helping (or if you aren’t particularly scrupulous, appearing to help) others around you achieve their goals, so that they then help you achieve yours. Start Inside Probably the best way to convince others of the value of your ideas is to first convince them that your ideas will help them achieve their own goals. To do that, yo...
3. Rational case that explains in nontechnical terms
Start Building the Case Once you have learned what you need to know, the next step is to begin to put together a calm and rational case that explains in nontechnical terms what the vulnerabilities, risks, and costs are. The case must include a discussion of the risks of inaction. Find Allies Ask around your organization. Look for friends and colleagues who share your concerns. Maybe you’ll find someone who has tried to convince management of something in the past. At the very l...
4. 20 Key High Availability Design Principles 1
#20: Don’t Be Cheap One of the basic rules of life in the 21st century is that quality costs money. Whether you are buying ice cream (“Do I want the Ben & Jerry’s at $4.00 per pint, or the store brand with the little ice crystals in it for 79 cents a gallon?”), cars (Rolls-Royce or Saturn), or barbecue grills, the higher the quality, the more it costs. The decision to implement availability is a business decision. It comes down to dollars and cents. If you look at the business decis...
5. Consolidate Your Servers
#16: Consolidate Your Servers The trend over the last few years in many computing circles has been to consolidate servers that run similar services. Instead of having many small singlepurpose machines or lots of machines running a single instance of a database, companies are rolling them together and putting all the relevant applications onto one or more larger servers with a capacity greater than all of the replaced servers. This setup can significantly reduce the complexity of your computing envir...
6. Documentation provides audit trails to work that has been completed
#13: Document Everything The importance of good, solid documentation simply cannot be overstated. Documentation provides audit trails to work that has been completed. It provides guides for future system administrators so that they can take over systems that existed before they arrived. It can provide the system administrator and his management with accomplishment records. (These can be very handy at personnel review time.) Good documentation can also help with problem solving. 1. The first audience is the...
7. Keep your production and development environments separate
#10: Test Everything Not only do crisis plans need to be tested, so do all new applications, system software, hardware modifications, and pretty much any change at all. Ideally, testing should take place in a production-like environment, with as similar an environment to the operational one as possible, and with as much of the same hardware, networks, and applications as possible. Even better, the same users should perform the tests. The tests need to be performed with the same production network configuration and...
8. Two relational database management systems
#6: Choose Mature Software Let’s say that you have a choice between two relational database management systems (RDBMSs); for our purposes, we’ll say the choices are the current release of Oracle and Joe’s Database v1.0, from Joe’s Database and Storm Door Company of Ypsilanti, Michigan. (We are not endorsing Oracle; the same rules would apply to any mature software product. As far as we know, Joe has not yet released a database.) Joe’s product has a couple of features that make it a li...
9. User documentation is often a good starting point
#3: Exploit External Resources Most likely, whatever problem you are trying to solve, or whatever product you are trying to implement, someone has done it before you. The vendor probably has a consulting or professional services organization that, for a fee, will visit your site and implement your critical solutions for you, or at least offer advice on how to architect and implement your plans. Arrange for on-site consultation from vendor resources or independent contractors, and be sure a transfer-...
