In: Categories » Computers and technology » Servers » 20 Key High Availability Design Principles 1
One of the basic rules of life in the 21st century is that quality costs money. Whether you are buying ice cream (“Do I want the Ben & Jerry’s at $4.00 per pint, or the store brand with the little ice crystals in it for 79 cents a gallon?”), cars (Rolls-Royce or Saturn), or barbecue grills, the higher the quality, the more it costs. The decision to implement availability is a business decision. It comes down to dollars and cents. If you look at the business decision to implement availability purely as “how much will it cost me,” then you are missing half of the equation, and no solution will appear adequate. Instead, the decision must also consider how much the solution will save the business and balance the savings against its cost. When considering the return on investment of a particular availability solution, look at how it will increase uptime and increase the value of your systems because they are up more. Consider the increase in productivity among the users of the critical system and how, because the system is up, they won’t have to work overtime or be idled during the workweek. There is no question that implementing a protective measure will cost money; the key is to balance that against the ROI. That model is not perfect because it requires making predictions, but educated guesses will get you figures that are close to reality. Over time, you can revisit the figures and improve them for the future. It probably doesn’t make sense to spend a million dollars to protect a development system in most shops, although there are surely enterprises where that level of protection is necessary. The extra hardware and software required to implement a redundant data center, clustering, and mirrors will surely cost quite a bit of extra money. Hiring the skilled personnel who will make the right decisions on system implementation and management will cost extra money. But the hardware, software, and people, properly deployed, will save the company money in reduced downtime. The trick is to find the appropriate balance between cost and value.
#19: Assume Nothing
Despite marketing claims to the contrary, high availability does not come with computer systems. Achieving production-caliber levels of end-to-end system availability requires effort directed at testing, integration, and applicationlevel assessments. None of these things are done for you by the vendor directly out of the box. Very few products can simply be dropped into an environment and be expected to add quality of availability. In fact, without up-front engineering efforts and costs, the opposite is true: Poorly implemented new products can reduce overall system availability. Don’t expect product features that work in one situation to continue to operate in other, more complex environments. When you add reliability, you add constraints, and you’ll have to test and verify the new bounds of operation. Don’t assume that application developers are aware of or sensitive to your planned production environment or the operational rules that you’ll impose there. Part of the job of availability design is doing the shuttle diplomacy between application developers, operations staff, and network management crews. Beyond that, vendors do not just throw in clustering or replication software; it has to be purchased at additional cost above the systems that require protection. (To be fair, at least one OS vendor does include clustering software for free with their high-end OS. You can be certain that the price is just hidden in the price of the OS.) Without extra work and software, data will not get backed up, systems won’t scale as needs increase, and virus and worm protection will not be implemented. Achieving production-quality availability invariably requires careful planning, extra effort, and additional spending.
#18: Remove Single Points of Failure (SPOFs)
Asingle point of failure (SPOF) is a single component (hardware, firmware, software, or otherwise) whose failure will cause some degree of downtime. Although it’s a cliché, it’s an apt one: Think of the SPOF as the weakest link in your chain of availability. When that one link breaks, regardless of the quality of the rest of the chain, the chain is broken. There are obvious potential SPOFs, such as servers, disks, network devices, and cables; most commonly these are protected against failure via redundancy. There are other, equally dangerous, second-order SPOFs that also need attention. Walk through your entire execution chain, from disk to system to application to network and client, and identify everything that could fail: applications, backups, backup tapes, electricity, the physical machine room, the building, interbuilding ways used for network cable runs, wide area networks, and Internet service providers (ISPs). Reliance on external services, such as Dynamic Host Configuration Protocol (DHCP) or Domain Name Service (DNS), can also be a SPOF. After you have identified your SPOFs, make a concerted effort to eliminate as many of them as possible, by making them redundant or by getting them out of the chain, if possible. It is, in fact, not possible to remove every single SPOF. Ultimately, the planet Earth is a SPOF, but if the Earth suffered a catastrophic failure, you probably wouldn’t be thinking about your systems anymore. And no amount of preparation could protect you against that particular failure. On a more realistic level, if you run parallel wide area networks to connect a primary site with a secondary or disaster recovery site, the networks are very likely to run through the same building, conduit, or even shared cable somewhere upstream. Wide area bandwidth is generally leased from a small number of companies. WAN service providers are usually quite reluctant to admit this fact, but that is the way things are.
#17: Enforce Security
Entire articles and multiday seminars have been written on maintaining a high level of system security. It is not our intent to replace any of them. However, making your systems secure is still a fundamental element of achieving overall system availability. The Unix tool sudo can be used to limit privileged access on Unix systems, so that DBAs and other users who need privileged access can have it even while their actions are restricted to a fixed set of commands, and their actions are logged. That functionality can be achieved on Windows through several mechanisms, including the Windows Scripting Host (WSH). Some Windows administrative commands have a Run As option on the rightclick menu that allows changing users. Use firewalls. A fundamental way to keep unwelcome visitors out of your network and off of your critical systems is through the use of firewalls. Firewalls are not the be-all and end-all of security, as there are almost always ways to sneak through them, but they are an effective starting point. Enforce good password selection. Lots of freeware and shareware utilities enforce good password selection. Some of them attempt to decrypt passwords through brute-force methods, using common words and variations on those words. Others do their work when the passwords are created to ensure that a sufficient variety in characters is used, thus making the password difficult for a guesser to figure out. Another common tool is password aging, where users are forced to change their passwords every 30 or 90 days, and they cannot reuse any password from the last set of 5 or 10.
Beware, though, when combining password aging with other requirements; you may unintentionally force users to write down their passwords. Change default system passwords. Old Digital Equipment Corporation (DEC) Virtual Memory System (VMS) systems had a default password set for the administrator (privileged) account. Since the world was not networked, this was not a big security hole, but with the Internet, it becomes frighteningly easy to access systems on other networks, especially when firewalls are not adequately deployed. Some versions of Microsoft SQL have a default password for the administrator account; it is critical that this password be changed to keep hackers out of your systems. Train your users about basic system security rules. Not long ago, there was a TV commercial where a user proudly says to her system administrator, “I opened that virus just like you told us not to.” Users need to be instructed in the rules of basic security. They should not open email attachments from unknown sources, and they should be very careful opening attachments from known sources. They need to be wary of using certain file extensions. They should never give out or write down their passwords. Delete logins and files of ex-employees after they leave. Beware of a disgruntled employee who departs her employment on bad terms. If she is clever and really angry, she may leave behind a time bomb (a program that destroys files or other data at a later date) or some sort of back door permitting her to access your systems at a later date. Don’t just delete files from the ex-employee’s home directory or primary server; search the network for other files that she may have owned. Use virus checkers, and keep them up-to-date. Like firewalls, a virus checker is a good first step in protecting your system, but if you don’t keep the virus definitions up-to-date, a newly released virus can still incapacitate your systems. There was a time when the conventional wisdom was that you needed to update your virus definitions monthly. Today, conventional wisdom says that you should update your virus definitions no less than weekly, and more often when you hear of a new virus being released. Unix and its variants (including Linux) are much more virus-resistant than Windows systems are; generally, Unix administrators do not need to be nearly as concerned with viruses as Windows administrators do. Check the Web for alerts. Sometimes you can keep up on security issues and new viruses by reading the news, but in general, you need to keep an eye on web sites like www.cert.org where you can get reliable, unbiased, and timely security information.
legal notice
Our website is not responsible for the information contained by this article. Web-articles is a free articles resource.
Suggestion: If you need fresh, daily updated content for your website, feel free to use our service. Click here for more information.
Useful tools and features
related articles
Network Failures Networks are naturally susceptible to failures because they contain many components and are affected by the configuration of every component. Where, exactly, is your network? In the switch? The drop cables? Bounded by all of the network interface cards in your systems? Any of those physical components can break, resulting in network outages or, more maddeningly, intermittent network failures. Networks are also affected by configuration problems. Incorrect routing information, duplicate host...
2. Web and Application Server Failures
Web and Application Server Failures The bugs that can strike a database can also affect a web server. Of course, many web servers are part of client/server applications that query back-end database servers to service client requests. So, anything affecting the database server will have an adverse effect on the web server as well. However, there are many other places within the web server environment where things might go awry. There are many new places for bugs to crop up, including in the Common Gateway Interfa...
3. Your system fails because the operating system panics
Renewability Let’s say your system fails because the operating system panics. It reboots, restarts applications such as web servers and databases, and continues on as before the failure. What’s the probability of another failure due to an operating system panic? In all likelihood, it’s exactly the same as it was before the reboot. There are many cases, however, in which repairing a system changes the MTBF characteristics of the system, increasing the probability of another failure in the near-te...
4. Direct and Indirect Costs of Downtime
The Costs of Downtime The only way to convince the people who control the purse strings that there is value in protecting uptime is to approach the problem from a dollars-andcents perspective. In this section, we provide some ammunition that should help make the case to even the most stubborn manager. Direct Costs of Downtime The most obvious cost of downtime is probably not the most expensive one: lost user productivity. The actual cost of that downtime is dependent upon what work your user...
5. COST OF DOWNTIME IS NOT A CONSTANT
Further complicating matters is the fact that the cost of downtime is not a constant. We will assume it to be constant for the purposes of our calculations (it makes them much, much simpler), but in reality, the cost of downtime increases as the duration of an outage increases. Consider again the effects of downtime on an e-commerce site. If the site suffers a brief outage (a few seconds), the cost will be minimal, perhaps even negligible. An outage of a minute or less probably will not affect business too badly: All...
6. The Politics of Availability
To persuade others of the value of your ideas, it is necessary to delve into the dark, shadowy world of organizational politics. Fundamentally, this means that you achieve your goals by helping (or if you aren’t particularly scrupulous, appearing to help) others around you achieve their goals, so that they then help you achieve yours. Start Inside Probably the best way to convince others of the value of your ideas is to first convince them that your ideas will help them achieve their own goals. To do that, yo...
7. Rational case that explains in nontechnical terms
Start Building the Case Once you have learned what you need to know, the next step is to begin to put together a calm and rational case that explains in nontechnical terms what the vulnerabilities, risks, and costs are. The case must include a discussion of the risks of inaction. Find Allies Ask around your organization. Look for friends and colleagues who share your concerns. Maybe you’ll find someone who has tried to convince management of something in the past. At the very l...
8. Consolidate Your Servers
#16: Consolidate Your Servers The trend over the last few years in many computing circles has been to consolidate servers that run similar services. Instead of having many small singlepurpose machines or lots of machines running a single instance of a database, companies are rolling them together and putting all the relevant applications onto one or more larger servers with a capacity greater than all of the replaced servers. This setup can significantly reduce the complexity of your computing envir...
9. Documentation provides audit trails to work that has been completed
#13: Document Everything The importance of good, solid documentation simply cannot be overstated. Documentation provides audit trails to work that has been completed. It provides guides for future system administrators so that they can take over systems that existed before they arrived. It can provide the system administrator and his management with accomplishment records. (These can be very handy at personnel review time.) Good documentation can also help with problem solving. 1. The first audience is the...
10. Keep your production and development environments separate
#10: Test Everything Not only do crisis plans need to be tested, so do all new applications, system software, hardware modifications, and pretty much any change at all. Ideally, testing should take place in a production-like environment, with as similar an environment to the operational one as possible, and with as much of the same hardware, networks, and applications as possible. Even better, the same users should perform the tests. The tests need to be performed with the same production network configuration and...
