Keep your production and development environments separate

an article added by: Ben Smeider at 11272007


In: Root » Computers and technology » Servers » Keep your production and development environments separate

French Spanish Portuguese Italian German Japanese Chinese Korean Russian Arabic

#10: Test Everything

Not only do crisis plans need to be tested, so do all new applications, system software, hardware modifications, and pretty much any change at all. Ideally, testing should take place in a production-like environment, with as similar an environment to the operational one as possible, and with as much of the same hardware, networks, and applications as possible. Even better, the same users should perform the tests. The tests need to be performed with the same production network configuration and loads, and with the same user inputs and application sets. User simulation and performance modeling tools may help to generate the first few tests in a quality assurance environment, but you’re going to want to test out an application in an environment that looks and feels live before really turning the system on. Be sure that you test at the unit level (what if this disk fails?); at the subassembly or subsystem level (what if we unplug this disk array or pull out this network connection?); at the complete system or application layer; and then on a full end-to-end basis. Testing should always be based on a clear and well-thought-out test plan, with specific goals, and acceptance and failure criteria. All test plans should be documented so that they can be referenced in later tests and so that tests can be researched without having to be duplicated. Significant testing efforts should be followed by postmortem meetings, where the test results, and the test procedures themselves, can be discussed. When considering the results of tests, always remember what was being tested. A common mistake that is made in testing is the assumption that if the test didn’t go well, it was the people who messed up. People are people; if the test goes badly, then it is the test that has failed and that must be changed. Tests need to be repeated on a regular basis. Environments change, and the only way to be sure that everything works and works together is to test it. Otherwise problems will occur in production that will inevitably lead to downtime. Adopting a regular testing policy is akin in popularity to regular rereadings of Moby Dick. But no matter how unpopular or distasteful documentation and testing are, they are still preferable to downtime. You’ll catch many problems in the earlier tests that may be buried or harder to find in the more complex, later tests, or that don’t turn up until production rollout. One way to increase the value of testing is to automate it. Automated testing can provide a level of thoroughness that is difficult to achieve through manual testing. Testers can get bored or tired, and they may not test every permutation of testing parameters, especially after several go-rounds of testing. Many tools are commercially available to help automate testing; these tools will not have boredom issues.

#9: Separate Your Environments

Keep your production and development environments separate and independent of each other—not just the servers, but the networks and the users. Development users should never be permitted routine access to production systems, except during the first few days of a new rollout (and even then, only under carefully controlled and monitored conditions) and when a critical problem occurs that requires their attention. Without separate environments, change control cannot be enforced in your production environment. Ideally, a welldesigned environment should contain as many as six different environments: Production. In production, changes are made only with significant controls. Everything in production needs to work all the time. If something doesn’t work, there must be a way to roll back to a version that did work. Changes must be made smoothly, and with a minimum of interruption to the production workflow. Any changes should go through a change control committee, as described previously. Production mirror. The production mirror is more properly called a production time warp. It should be an accurate reflection of the production environment as it was two or three weeks ago. Any updates to production, whether approved or otherwise, should be applied to the mirror after an agreed-upon period of time has passed. This environment permits production to roll back to a working production environment in the event that defective software is installed in production. As noble an idea as this is, it is very seldom used in practice. Quality assurance (QA). This is a true test environment for applications that are believed to be production-ready. Quality assurance is the last stop for code before it goes live. Changes made to this environment should be as well controlled as changes made to production or to the production mirror. Development. Clearly this environment is for works in progress. Change need not be monitored very closely or approved with the same degree of procedure as production, but at the same time, the environment must be maintained properly and cleanly for developers to work. In development, code can be installed and run that may not work 100 percent. It may be genuinely buggy. That’s okay here, as long as the software is running only in development and its bugginess does not affect the other environments in any way. Laboratory. Often called a sandbox, the laboratory is a true playground. This is where SAs get to play with new hardware, new technologies, new third-party applications, and whatever else comes along. An interesting side benefit of a good lab environment is that it works as a change of pace for your system administrators. With all the automation and procedures that are in place on many production systems, it’s healthy to allow a bored or burned-out SA to have a place to go where the rules aren’t so strict. The lab may contain new cutting-edge equipment or be used to set up and solve duplicated thorny production problems in a nonproduction environment. Labs can be a wonderful opportunity for your SAs to develop some real hands-on experience and to exercise their coding or hardware skills. Disaster recovery/business contingency site. This site is located some distance, possibly thousands of miles, away from the main production site. In the event that some major catastrophe takes out the production site, a reasonably quick switchover can be made and production can resume from the disaster recovery site. If no code is developed in-house, then you may not need a development environment, for example. You still should test out externally developed applications before implementing them, but that integration function may be combined with a QA environment in a single staging area. The production mirror is a luxury that many sites cannot afford, so if it is used at all, it is most often combined with the DR site, and not maintained two or three weeks in the past as discussed previously. Combining these two environments does introduce some risk, of course, and limits the ability of the DR site to recover from a bad change that has been made to production.

#8: Learn from History

In order to see what changes to make on your system to make it more resilient, you need to look at the recent history of your system. Why does your system go down? What are the most common causes? Don’t just rely on anecdotal information (“Well, it sure does seem to go down a lot on Thursdays”). Keep real records. Check them often. Look for patterns. (Maybe the system really does go down a lot on Thursdays. Now you need to figure out why!) Maintain availability statistics over time. Account for every incident that causes downtime. Understand the root cause of the failure and the steps needed to fix it. If you’ve invested in failure isolation, you should have an easier time finding root causes. Look closely at the MTTR, and see if something can be done to improve it. Did it take two days for key parts to arrive? Was there a single person whose expertise was critical to the repair process—and was he unavailable? Use the evaluations to identify your most common problems, and attack them first. Don’t waste your time protecting the CPU if the most common cause of downtime is an application bug. The old 80/20 rule of thumb works here. Roughly 80 percent of your downtime is likely to be caused by 20 percent of the problems. Fix that 20 percent and you should see a tremendous improvement in availability. Then you should be able to reapply the 80/20 rule on the next 80 percent of downtime. The other quote that we considered for this principle is the one from the mutual fund commercials: “Past performance is no guarantee of future success.” However, that is not quite the message that we wanted to get across. On the other hand, that quote does help prove the adage that for every saying, there is an equal and opposite saying. Consider the combination of “look before you leap” and “he who hesitates is lost” or “out of sight, out of mind” and “absence makes the heart grow fonder.”

#7: Design for Growth

Experience tells us that system use always expands to fill system capacity. Whether it’s CPU, memory, I/O, or disk space, it will all get consumed. This means that the 2TB of disk you just bought for that server will be all used up in a few months. That’s just the way of the world. If you go into the design of a computer system with this experience in mind, you will build systems with room for easy growth and expansion. If you need 8 CPUs in your server, buy a 16-CPU server, and only put 8 CPUs in it. If you buy and fill an 8-CPU server, when it’s time to add more CPUs, you may have to purchase a whole new server, or at least additional system boards (if there is room for them) to obtain the additional capacity. Some system vendors will even put the extra 8 CPUs in your server and leave them disabled until you need them. When you need them, the vendor will activate them and charge you for them then. (Beware of this practice: You may be buying hardware that will be outmoded by the time you actually need it.) If you buy a large disk array completely full of disks, when it is time to expand your disk capacity, even by a single disk, you will need to buy another array. If you find that you don’t have enough I/O slots in the server, you’re in trouble. The industry buzzword for this is scalability. Make sure that your system will scale as your requirements for those systems scale. The incremental cost of rolling in a new systems or storage frame is considerable. The downtime implications for adding new frames or adding boards to the backplane will also be significant. By spending the extra money up front for the additional capacity, you can save yourself both downtime and cost when you need to grow the systems.

legal disclaimer

Our website is not responsible for the information contained by this article. Web-articles is a free articles resource.
Suggestion: If you need fresh, daily updated content for your website, feel free to use our service. Click here for more information.

related articles

1. Direct and Indirect Costs of Downtime
The Costs of Downtime The only way to convince the people who control the purse strings that there is value in protecting uptime is to approach the problem from a dollars-andcents perspective. In this section, we provide some ammunition that should help make the case to even the most stubborn manager. Direct Costs of Downtime The most obvious cost of downtime is probably not the most expensive one: lost user productivity. The actual cost of that downtime is dependent upon what work your user...

2. COST OF DOWNTIME IS NOT A CONSTANT
Further complicating matters is the fact that the cost of downtime is not a constant. We will assume it to be constant for the purposes of our calculations (it makes them much, much simpler), but in reality, the cost of downtime increases as the duration of an outage increases. Consider again the effects of downtime on an e-commerce site. If the site suffers a brief outage (a few seconds), the cost will be minimal, perhaps even negligible. An outage of a minute or less probably will not affect business too badly: All...

3. The Politics of Availability
To persuade others of the value of your ideas, it is necessary to delve into the dark, shadowy world of organizational politics. Fundamentally, this means that you achieve your goals by helping (or if you aren’t particularly scrupulous, appearing to help) others around you achieve their goals, so that they then help you achieve yours. Start Inside Probably the best way to convince others of the value of your ideas is to first convince them that your ideas will help them achieve their own goals. To do that, yo...

4. Rational case that explains in nontechnical terms
Start Building the Case Once you have learned what you need to know, the next step is to begin to put together a calm and rational case that explains in nontechnical terms what the vulnerabilities, risks, and costs are. The case must include a discussion of the risks of inaction. Find Allies Ask around your organization. Look for friends and colleagues who share your concerns. Maybe you’ll find someone who has tried to convince management of something in the past. At the very l...

5. 20 Key High Availability Design Principles 1
#20: Don’t Be Cheap One of the basic rules of life in the 21st century is that quality costs money. Whether you are buying ice cream (“Do I want the Ben & Jerry’s at $4.00 per pint, or the store brand with the little ice crystals in it for 79 cents a gallon?”), cars (Rolls-Royce or Saturn), or barbecue grills, the higher the quality, the more it costs. The decision to implement availability is a business decision. It comes down to dollars and cents. If you look at the business decis...

6. Consolidate Your Servers
#16: Consolidate Your Servers   The trend over the last few years in many computing circles has been to consolidate servers that run similar services. Instead of having many small singlepurpose machines or lots of machines running a single instance of a database, companies are rolling them together and putting all the relevant applications onto one or more larger servers with a capacity greater than all of the replaced servers. This setup can significantly reduce the complexity of your computing envir...

7. Documentation provides audit trails to work that has been completed
#13: Document Everything The importance of good, solid documentation simply cannot be overstated. Documentation provides audit trails to work that has been completed. It provides guides for future system administrators so that they can take over systems that existed before they arrived. It can provide the system administrator and his management with accomplishment records. (These can be very handy at personnel review time.) Good documentation can also help with problem solving. 1. The first audience is the...

8. Two relational database management systems
#6: Choose Mature Software Let’s say that you have a choice between two relational database management systems (RDBMSs); for our purposes, we’ll say the choices are the current release of Oracle and Joe’s Database v1.0, from Joe’s Database and Storm Door Company of Ypsilanti, Michigan. (We are not endorsing Oracle; the same rules would apply to any mature software product. As far as we know, Joe has not yet released a database.) Joe’s product has a couple of features that make it a li...

9. User documentation is often a good starting point
#3: Exploit External Resources   Most likely, whatever problem you are trying to solve, or whatever product you are trying to implement, someone has done it before you. The vendor probably has a consulting or professional services organization that, for a fee, will visit your site and implement your critical solutions for you, or at least offer advice on how to architect and implement your plans. Arrange for on-site consultation from vendor resources or independent contractors, and be sure a transfer-...

10. Incremental Backups of Databases
In general, incremental backups are limited to filesystems, although some backup vendors do have technology that will permit the incremental backing up of databases. Specifically, to do an incremental backup of a filesystem, the blocks that have changed must be backed up. Once they are backed up, pointers and indices must be maintained so that the blocks can be put back into the database upon restore. Some solutions require a complete scan of the database for changed blocks. At least one solution (VERITAS NetBackup w...