Documentation provides audit trails to work that has been completed

an article added by: Ben Smeider at 11272007


In: Root » Computers and technology » Servers » Documentation provides audit trails to work that has been completed

French Spanish Portuguese Italian German Japanese Chinese Korean Russian Arabic

#13: Document Everything

The importance of good, solid documentation simply cannot be overstated. Documentation provides audit trails to work that has been completed. It provides guides for future system administrators so that they can take over systems that existed before they arrived. It can provide the system administrator and his management with accomplishment records. (These can be very handy at personnel review time.) Good documentation can also help with problem solving.

1. The first audience is the author himself. Documentation aimed at the software author makes it much easier for the author to go back and debug problems in his own code that may occur years later. When you look back at your own two- or three-year-old work (system or network design or programs, for example), it can be almost impossible to remember why you did something a particular way, and not another way, despite having excellent reasons at the time. Write comments in your code. Write manuals, even if they are just a couple of pages long.

2. The second audience is sometime in the future. The people who maintain the systems and the applications today aren’t going to be around forever. And you can’t assume that any experienced personnel will be around to mentor newcomers. You must prepare for a massive exodus of your experienced people. A popular sign on many system administrators’ desks reads: “Remember, 100 years from now, all new people!”

3. The third audience is management. Keeping good notes and documenting your work helps demonstrate to management that you have been diligent and shows what measures have been taken to keep the systems running and productive. If a system does not meet the requirements set forth in the documentation, it should be easy to figure out why and to determine what must be done to make the system compliant. Even if your management does not understand the value of the services you provide to the organization, they will usually understand the value when it is presented as a thick stack of documentation.

Make sure that documentation is stored on paper too. If the system is down when you need the manuals, you’re not going to be able to get to them. Keep them in binders so that they can be easily changed and updated. After the documentation is written, don’t forget to review it and update it on a regular basis. Bad documentation is worse than none at all. In a crisis, especially if the most knowledgeable people are not around, documentation is likely to be followed verbatim. If it’s wrong, things can go from bad to worse. A common question about preparing documentation is at what technical level it should be written. One school of thought says that documentation should be written so that anyone in the organization, from a janitor to the CEO, can follow it and, if required, bring the systems up after a disaster. After all, the system administrative staff may not be available at the time, and someone has to do it. While that is a noble goal, it is an impractical one. The state of the critical systems changes all the time. The more detail that is included in that documentation, the quicker it becomes out-of-date. In order to write documentation so that an untrained smart person could bring systems up or do other critical work on them in time of crisis, it would have to be so detailed as to be unmanageable. If the task at hand is to edit a file to change one variable to another, for an appropriately trained system administrator, the documentation would only need to say, “Change ABC to XYZ in /directory/file.txt.” For an untrained smart person, the documentation would need to say:

1. Change directory to /directory by typing “cd /directory”.

2. Open the vi text editor by typing “vi file.txt”.

3. Move the cursor down to line 300 by typing “300G”.

and so on (in whatever operating environment is appropriate). Every time the line number in the file changed, the documentation would have to be changed. Many error conditions would need to be handled for the inexperienced person as well; for example, if the terminal type or line count is set wrong, the vi editor may have trouble dealing with it. In that event, what should the inexperienced person do? The right answer, therefore, is to write documentation so that an experienced system administrator could use it. Don’t expect the administrator to have a great deal of experience in your shop, just general experience. In a Windows environment, target a Microsoft Certified Engineer (MSCE). In Unix environments, target experienced system administrators. If something happens to the current system administrative staff, you can reasonably assume that they will be replaced by experienced people. Documentation is a lot like a fine piece of literature. Everybody wants to say that they have read Moby Dick, but nobody wants to actually sit down and read Moby Dick.

#12: Employ Service Level Agreements

Before disaster strikes, many organizations put written agreements in place with their user community to define the levels of service that will be provided. Some agreements include penalties for failing to meet, and rewards for greatly exceeding, agreed-to service levels. Service-level agreements might deal with the following areas: Availability levels. What percentage of the time are the systems actually up? Or, how many outages are acceptable during a given service period? And how long can each outage be? Hours of service. During what hours are the systems actually critical? On what days of the week? What about major holidays? What about lesser holidays? What about weekend holidays? Locations. Are there multiple locations? Can all locations expect the same levels of service? What about locations that do not have on-site staff? Priorities. What if more than one system is down at the same time? Which group of users gets priority? Escalation policy. What if agreements cannot be met? Who gets called? After how much time has elapsed? What are the ramifications? Limitations. Some components of service delivery will be outside local control. Do those count against availability guarantees anyway? These agreements are usually the result of considerable negotiations between (and often significant pain to) both customer and service provider. When designing SLAs, beware of ones that commit you to deliver nines of availability. You are betting the success of your job (if not the job itself) on whether or not the system will go down. Nines are a sensible approach only if you know exactly how many times the system will go down over a given period of time. Unfortunately, though, managers, especially non-technical managers, really like to use the nines as an approach to SLAs because they are simple, straightforward, and easily measurable. They are also easy to explain to other non-technical people. If your performance is being judged on your ability to fulfill SLAs, avoid nines-based SLAs. A much more sensible way to design an SLA is to discuss specific types of failures and outages, and how long the system will be down as a result of them. When estimating downtimes, be conservative; everything takes longer than you think it will.

#11: Plan Ahead

Planning is a vital component to any significant project. When you are dealing with critical systems that have dozens or hundreds of users, their requirements must be taken into account any time a system is changed. Crisis situations, such as disasters, require significant planning so that everyone who is responsible for the recovery knows exactly where to go and what to do. In addition, no matter how good your automated protection tools are, there will occasionally be events that are outside of their ability to cope. Multiple simultaneous failures will stretch most software and organizations to their limits. By having plans in place that are the result of calm and rational thinking well in advance of the calamity, a smooth recovery can be facilitated. Without a good plan, you’ll find yourself prioritizing and coordinating in real time and leaving yourself open to myriad variations on the theme of “fix it now!” Any kind of documented recovery plans should be approved by management and key personnel and may be part of a service level agreement. Keep these plans offline, in binders, and in multiple locations so that the acting, senior person on-site can execute them when required. Make sure they are kept current. If they contain confidential or personal information, be very careful with the distribution of the plans, and limit access only to those who need it. Planning also includes coordination. If you are planning to bring down some critical systems for scheduled maintenance, be sure you coordinate that downtime with the users of the systems. They may have deadlines of their own, and your scheduled downtime could interfere with that downtime. In most shops, a wise approach is to give the same level of scrutiny to scheduled downtime as to changes; a committee should bless scheduled downtime periods.

legal disclaimer

Our website is not responsible for the information contained by this article. Web-articles is a free articles resource.
Suggestion: If you need fresh, daily updated content for your website, feel free to use our service. Click here for more information.

related articles

1. Definitions for downtime vary from gentle to tough
Defining Downtime Definitions for downtime vary from gentle to tough, and from simple to complex. Easy definitions are often given in terms of failed components, such as the server itself, disks, the network, the operating system, or key applications. Stricter definitions may include slow server or network performance, the inability to restore backups, or simple data inaccessibility. We prefer a very strict definition for downtime: If a user cannot get her job done on time, the system is down. A computer syste...

2. File and Print Server Failures
Network Failures Networks are naturally susceptible to failures because they contain many components and are affected by the configuration of every component. Where, exactly, is your network? In the switch? The drop cables? Bounded by all of the network interface cards in your systems? Any of those physical components can break, resulting in network outages or, more maddeningly, intermittent network failures. Networks are also affected by configuration problems. Incorrect routing information, duplicate host...

3. Web and Application Server Failures
Web and Application Server Failures The bugs that can strike a database can also affect a web server. Of course, many web servers are part of client/server applications that query back-end database servers to service client requests. So, anything affecting the database server will have an adverse effect on the web server as well. However, there are many other places within the web server environment where things might go awry. There are many new places for bugs to crop up, including in the Common Gateway Interfa...

4. Your system fails because the operating system panics
Renewability Let’s say your system fails because the operating system panics. It reboots, restarts applications such as web servers and databases, and continues on as before the failure. What’s the probability of another failure due to an operating system panic? In all likelihood, it’s exactly the same as it was before the reboot. There are many cases, however, in which repairing a system changes the MTBF characteristics of the system, increasing the probability of another failure in the near-te...

5. Direct and Indirect Costs of Downtime
The Costs of Downtime The only way to convince the people who control the purse strings that there is value in protecting uptime is to approach the problem from a dollars-andcents perspective. In this section, we provide some ammunition that should help make the case to even the most stubborn manager. Direct Costs of Downtime The most obvious cost of downtime is probably not the most expensive one: lost user productivity. The actual cost of that downtime is dependent upon what work your user...

6. COST OF DOWNTIME IS NOT A CONSTANT
Further complicating matters is the fact that the cost of downtime is not a constant. We will assume it to be constant for the purposes of our calculations (it makes them much, much simpler), but in reality, the cost of downtime increases as the duration of an outage increases. Consider again the effects of downtime on an e-commerce site. If the site suffers a brief outage (a few seconds), the cost will be minimal, perhaps even negligible. An outage of a minute or less probably will not affect business too badly: All...

7. The Politics of Availability
To persuade others of the value of your ideas, it is necessary to delve into the dark, shadowy world of organizational politics. Fundamentally, this means that you achieve your goals by helping (or if you aren’t particularly scrupulous, appearing to help) others around you achieve their goals, so that they then help you achieve yours. Start Inside Probably the best way to convince others of the value of your ideas is to first convince them that your ideas will help them achieve their own goals. To do that, yo...

8. Rational case that explains in nontechnical terms
Start Building the Case Once you have learned what you need to know, the next step is to begin to put together a calm and rational case that explains in nontechnical terms what the vulnerabilities, risks, and costs are. The case must include a discussion of the risks of inaction. Find Allies Ask around your organization. Look for friends and colleagues who share your concerns. Maybe you’ll find someone who has tried to convince management of something in the past. At the very l...

9. 20 Key High Availability Design Principles 1
#20: Don’t Be Cheap One of the basic rules of life in the 21st century is that quality costs money. Whether you are buying ice cream (“Do I want the Ben & Jerry’s at $4.00 per pint, or the store brand with the little ice crystals in it for 79 cents a gallon?”), cars (Rolls-Royce or Saturn), or barbecue grills, the higher the quality, the more it costs. The decision to implement availability is a business decision. It comes down to dollars and cents. If you look at the business decis...

10. Consolidate Your Servers
#16: Consolidate Your Servers   The trend over the last few years in many computing circles has been to consolidate servers that run similar services. Instead of having many small singlepurpose machines or lots of machines running a single instance of a database, companies are rolling them together and putting all the relevant applications onto one or more larger servers with a capacity greater than all of the replaced servers. This setup can significantly reduce the complexity of your computing envir...