In: Categories » Computers and technology » Servers » Keep your production and development environments separate
Not only do crisis plans need to be tested, so do all new applications, system software, hardware modifications, and pretty much any change at all. Ideally, testing should take place in a production-like environment, with as similar an environment to the operational one as possible, and with as much of the same hardware, networks, and applications as possible. Even better, the same users should perform the tests. The tests need to be performed with the same production network configuration and loads, and with the same user inputs and application sets. User simulation and performance modeling tools may help to generate the first few tests in a quality assurance environment, but you’re going to want to test out an application in an environment that looks and feels live before really turning the system on. Be sure that you test at the unit level (what if this disk fails?); at the subassembly or subsystem level (what if we unplug this disk array or pull out this network connection?); at the complete system or application layer; and then on a full end-to-end basis. Testing should always be based on a clear and well-thought-out test plan, with specific goals, and acceptance and failure criteria. All test plans should be documented so that they can be referenced in later tests and so that tests can be researched without having to be duplicated. Significant testing efforts should be followed by postmortem meetings, where the test results, and the test procedures themselves, can be discussed. When considering the results of tests, always remember what was being tested. A common mistake that is made in testing is the assumption that if the test didn’t go well, it was the people who messed up. People are people; if the test goes badly, then it is the test that has failed and that must be changed. Tests need to be repeated on a regular basis. Environments change, and the only way to be sure that everything works and works together is to test it. Otherwise problems will occur in production that will inevitably lead to downtime. Adopting a regular testing policy is akin in popularity to regular rereadings of Moby Dick. But no matter how unpopular or distasteful documentation and testing are, they are still preferable to downtime. You’ll catch many problems in the earlier tests that may be buried or harder to find in the more complex, later tests, or that don’t turn up until production rollout. One way to increase the value of testing is to automate it. Automated testing can provide a level of thoroughness that is difficult to achieve through manual testing. Testers can get bored or tired, and they may not test every permutation of testing parameters, especially after several go-rounds of testing. Many tools are commercially available to help automate testing; these tools will not have boredom issues.
#9: Separate Your Environments
Keep your production and development environments separate and independent of each other—not just the servers, but the networks and the users. Development users should never be permitted routine access to production systems, except during the first few days of a new rollout (and even then, only under carefully controlled and monitored conditions) and when a critical problem occurs that requires their attention. Without separate environments, change control cannot be enforced in your production environment. Ideally, a welldesigned environment should contain as many as six different environments: Production. In production, changes are made only with significant controls. Everything in production needs to work all the time. If something doesn’t work, there must be a way to roll back to a version that did work. Changes must be made smoothly, and with a minimum of interruption to the production workflow. Any changes should go through a change control committee, as described previously. Production mirror. The production mirror is more properly called a production time warp. It should be an accurate reflection of the production environment as it was two or three weeks ago. Any updates to production, whether approved or otherwise, should be applied to the mirror after an agreed-upon period of time has passed. This environment permits production to roll back to a working production environment in the event that defective software is installed in production. As noble an idea as this is, it is very seldom used in practice. Quality assurance (QA). This is a true test environment for applications that are believed to be production-ready. Quality assurance is the last stop for code before it goes live. Changes made to this environment should be as well controlled as changes made to production or to the production mirror. Development. Clearly this environment is for works in progress. Change need not be monitored very closely or approved with the same degree of procedure as production, but at the same time, the environment must be maintained properly and cleanly for developers to work. In development, code can be installed and run that may not work 100 percent. It may be genuinely buggy. That’s okay here, as long as the software is running only in development and its bugginess does not affect the other environments in any way. Laboratory. Often called a sandbox, the laboratory is a true playground. This is where SAs get to play with new hardware, new technologies, new third-party applications, and whatever else comes along. An interesting side benefit of a good lab environment is that it works as a change of pace for your system administrators. With all the automation and procedures that are in place on many production systems, it’s healthy to allow a bored or burned-out SA to have a place to go where the rules aren’t so strict. The lab may contain new cutting-edge equipment or be used to set up and solve duplicated thorny production problems in a nonproduction environment. Labs can be a wonderful opportunity for your SAs to develop some real hands-on experience and to exercise their coding or hardware skills. Disaster recovery/business contingency site. This site is located some distance, possibly thousands of miles, away from the main production site. In the event that some major catastrophe takes out the production site, a reasonably quick switchover can be made and production can resume from the disaster recovery site. If no code is developed in-house, then you may not need a development environment, for example. You still should test out externally developed applications before implementing them, but that integration function may be combined with a QA environment in a single staging area. The production mirror is a luxury that many sites cannot afford, so if it is used at all, it is most often combined with the DR site, and not maintained two or three weeks in the past as discussed previously. Combining these two environments does introduce some risk, of course, and limits the ability of the DR site to recover from a bad change that has been made to production.
#8: Learn from History
In order to see what changes to make on your system to make it more resilient, you need to look at the recent history of your system. Why does your system go down? What are the most common causes? Don’t just rely on anecdotal information (“Well, it sure does seem to go down a lot on Thursdays”). Keep real records. Check them often. Look for patterns. (Maybe the system really does go down a lot on Thursdays. Now you need to figure out why!) Maintain availability statistics over time. Account for every incident that causes downtime. Understand the root cause of the failure and the steps needed to fix it. If you’ve invested in failure isolation, you should have an easier time finding root causes. Look closely at the MTTR, and see if something can be done to improve it. Did it take two days for key parts to arrive? Was there a single person whose expertise was critical to the repair process—and was he unavailable? Use the evaluations to identify your most common problems, and attack them first. Don’t waste your time protecting the CPU if the most common cause of downtime is an application bug. The old 80/20 rule of thumb works here. Roughly 80 percent of your downtime is likely to be caused by 20 percent of the problems. Fix that 20 percent and you should see a tremendous improvement in availability. Then you should be able to reapply the 80/20 rule on the next 80 percent of downtime. The other quote that we considered for this principle is the one from the mutual fund commercials: “Past performance is no guarantee of future success.” However, that is not quite the message that we wanted to get across. On the other hand, that quote does help prove the adage that for every saying, there is an equal and opposite saying. Consider the combination of “look before you leap” and “he who hesitates is lost” or “out of sight, out of mind” and “absence makes the heart grow fonder.”
#7: Design for Growth
Experience tells us that system use always expands to fill system capacity. Whether it’s CPU, memory, I/O, or disk space, it will all get consumed. This means that the 2TB of disk you just bought for that server will be all used up in a few months. That’s just the way of the world. If you go into the design of a computer system with this experience in mind, you will build systems with room for easy growth and expansion. If you need 8 CPUs in your server, buy a 16-CPU server, and only put 8 CPUs in it. If you buy and fill an 8-CPU server, when it’s time to add more CPUs, you may have to purchase a whole new server, or at least additional system boards (if there is room for them) to obtain the additional capacity. Some system vendors will even put the extra 8 CPUs in your server and leave them disabled until you need them. When you need them, the vendor will activate them and charge you for them then. (Beware of this practice: You may be buying hardware that will be outmoded by the time you actually need it.) If you buy a large disk array completely full of disks, when it is time to expand your disk capacity, even by a single disk, you will need to buy another array. If you find that you don’t have enough I/O slots in the server, you’re in trouble. The industry buzzword for this is scalability. Make sure that your system will scale as your requirements for those systems scale. The incremental cost of rolling in a new systems or storage frame is considerable. The downtime implications for adding new frames or adding boards to the backplane will also be significant. By spending the extra money up front for the additional capacity, you can save yourself both downtime and cost when you need to grow the systems.
legal notice
Our website is not responsible for the information contained by this article. Web-articles is a free articles resource.
Suggestion: If you need fresh, daily updated content for your website, feel free to use our service. Click here for more information.
Useful tools and features
related articles
Measuring Availability When you discuss availability requirements with a user or project leader, he will invariably tell you that 100 percent availability is required: “Our project is so important that we can’t have any downtime at all.” But the tune usually changes when the project leader finds out how much 100 percent availability would cost. Then the discussion becomes a matter of money, and more of a negotiation process. As you can see in Table 2.1, for many applications, 99 percent uptim...
2. Definitions for downtime vary from gentle to tough
Defining Downtime Definitions for downtime vary from gentle to tough, and from simple to complex. Easy definitions are often given in terms of failed components, such as the server itself, disks, the network, the operating system, or key applications. Stricter definitions may include slow server or network performance, the inability to restore backups, or simple data inaccessibility. We prefer a very strict definition for downtime: If a user cannot get her job done on time, the system is down. A computer syste...
3. File and Print Server Failures
Network Failures Networks are naturally susceptible to failures because they contain many components and are affected by the configuration of every component. Where, exactly, is your network? In the switch? The drop cables? Bounded by all of the network interface cards in your systems? Any of those physical components can break, resulting in network outages or, more maddeningly, intermittent network failures. Networks are also affected by configuration problems. Incorrect routing information, duplicate host...
4. Web and Application Server Failures
Web and Application Server Failures The bugs that can strike a database can also affect a web server. Of course, many web servers are part of client/server applications that query back-end database servers to service client requests. So, anything affecting the database server will have an adverse effect on the web server as well. However, there are many other places within the web server environment where things might go awry. There are many new places for bugs to crop up, including in the Common Gateway Interfa...
5. Your system fails because the operating system panics
Renewability Let’s say your system fails because the operating system panics. It reboots, restarts applications such as web servers and databases, and continues on as before the failure. What’s the probability of another failure due to an operating system panic? In all likelihood, it’s exactly the same as it was before the reboot. There are many cases, however, in which repairing a system changes the MTBF characteristics of the system, increasing the probability of another failure in the near-te...
6. Direct and Indirect Costs of Downtime
The Costs of Downtime The only way to convince the people who control the purse strings that there is value in protecting uptime is to approach the problem from a dollars-andcents perspective. In this section, we provide some ammunition that should help make the case to even the most stubborn manager. Direct Costs of Downtime The most obvious cost of downtime is probably not the most expensive one: lost user productivity. The actual cost of that downtime is dependent upon what work your user...
7. COST OF DOWNTIME IS NOT A CONSTANT
Further complicating matters is the fact that the cost of downtime is not a constant. We will assume it to be constant for the purposes of our calculations (it makes them much, much simpler), but in reality, the cost of downtime increases as the duration of an outage increases. Consider again the effects of downtime on an e-commerce site. If the site suffers a brief outage (a few seconds), the cost will be minimal, perhaps even negligible. An outage of a minute or less probably will not affect business too badly: All...
