In: Categories » Computers and technology » Servers » File and Print Server Failures
Networks are naturally susceptible to failures because they contain many components and are affected by the configuration of every component. Where, exactly, is your network? In the switch? The drop cables? Bounded by all of the network interface cards in your systems? Any of those physical components can break, resulting in network outages or, more maddeningly, intermittent network failures. Networks are also affected by configuration problems. Incorrect routing information, duplicate hostnames or IP addresses, and machines that misinterpret broadcast addresses can lead to misdirected packets. You’ll also have to deal with redundancy in network connections, as you may have several routers connecting networks at multiple points. When that redundancy is broken, or its configuration is misrepresented, the network appears to be down. When a network that you trust and love is connected to an untrusted or unmanaged network, you run the risk of being subject to a denial-of-service attack or a network penetration attempt from one of those networks. These types of attacks happen within well-run networks as well. Security mogul Bill Cheswick asks the attendees at his talks if they leave their wallets out in the open in their offices. Nary a hand goes up. Then he asks how many leave unprotected network access points like twisted-pair wall jacks in open offices, and you see the tentative hands raised. Access to the network is valuable and has to be protected while still allowing user activity to proceed without onerous overhead.
Finally, networks use a variety of core services or basic information services that we lump into the network fabric. Naming systems like NIS or DNS, security and authentication servers, or host configuration servers for hosts requiring DHCP to boot and join a network will bring down a network if they are not functioning or are giving out wrong answers.
File and Print Server Failures
When file and print servers fail, clients will hang or experience timeouts. A timeout can mean that a print job or a file request fails. The timeout can also lead to wrong answers or data corruption. For example, using Network File Systems (NFS) soft mounts, a write operation that times out will not be repeated. This can lead to holes in data files that will only be detected when the file is read.
Database System Failures
Like any complex application, database systems contain many moving parts. These moving parts are not found in fans or disk drives, however: They are the interrelated subapplications that make up any large enterprise application. The heart of a database system is the server process, or database engine, the main and primary database component that does the reading and writing to the disk, manages the placement of data, and responds to queries with (we hope) the correct answers. If this process stops working, all users accessing the database stop working. The database engine may be assisted by reader-writer or block manager processes that handle disk I/O operations for the engine, allowing it to execute database requests while other processes coordinate I/O and manage the disk block cache. Between the users and the database server sits the listener process. The listener takes the incoming queries from the users and turns them into a form that the database server can process. Then, when the server returns its answer, the listener sends the answer back to the user who requested it. The users, at their client workstations, run their end-user application, which is almost always one level removed from the actual SQL (structured query language) engine.
The end-user application translates the user’s request into SQL, which is then sent across the network to the listener. Well-written end-user applications also shield the user from the dreary complexities of the nearly perfect grammar that SQL requires, and from ordinary problems with the database, such as server crashes and other widespread downtime. Obviously, the failure of any of these processes in the chain will cause the database to be unavailable to its users. Possible failures can include the following:
Application crashes. The application stops running completely, leaving an error message (we hope) that will enable the administrators to determine the nature of the problem.
Application hangs. A more insidious problem with databases or other systems that have significant interaction with the operating system is when a component process, such as a listener, reader-writer process manager, or the database kernel, hangs waiting for a system resource to free or gets trapped in a deadlock with another process. Some very longrunning database operations (such as a scan and update of every record) may appear to make the system hang when they are really just consuming all available cycles.
Resource shortfalls. The most common resource shortfall to strike most database environments is inadequate disk space. If the space allocated to the database fills up, the database engine may crash, hang, or simply fail to accept new entries. None of these is particularly useful in a production environment. If the database itself doesn’t fill, the logs can overflow. There are logs that are written into the database disk space itself, and others that may be written into regular filesystem space. When data cannot be written to either type of log, the database will not perform as desired; it could hang, crash, stop processing incoming requests, or act in other antisocial ways.
Database index corruption. A database server may manage terabytes of data. To find this data quickly on their disks, database servers (and filesystems, for that matter) use a confusing array of pointers and links. Should these pointers become corrupted, the wrong data can be retrieved, or worse, the attempt to retrieve data from an illegal space can result in the application or the system crashing completely. Data corruption problems are fairly unusual because most good RDBMSs have consistency checkers, which scan the database for corruption on startup.
Buggy software. Almost by definition, software has bugs. (There is an old saw in computing that says all programs have at least one bug in them and can be shortened by at least one line. By extension, that means that all programs can be cut down to one line in length, and that line will have a bug in it.) Software is written by humans, and most of us, from time to time, make mistaks. Bugs can impact the system in various ways, from a simple misspelling in a log entry to a fatal bug that crashes the server and/or system. When trying to solve a problem, always consider the possibility that it was caused by a bug. Don’t just assume that all problems were caused by bugs, but at the same time, don’t strike bug from the list of possible causes for almost any problem. And these bugs can occur at any point in the subapplication chain, server processes, listener processes, client SQL engines, user applications, or even with the user’s keyed input.
legal notice
Our website is not responsible for the information contained by this article. Web-articles is a free articles resource.
Suggestion: If you need fresh, daily updated content for your website, feel free to use our service. Click here for more information.
Useful tools and features
related articles
Defining Downtime Definitions for downtime vary from gentle to tough, and from simple to complex. Easy definitions are often given in terms of failed components, such as the server itself, disks, the network, the operating system, or key applications. Stricter definitions may include slow server or network performance, the inability to restore backups, or simple data inaccessibility. We prefer a very strict definition for downtime: If a user cannot get her job done on time, the system is down. A computer syste...
2. Web and Application Server Failures
Web and Application Server Failures The bugs that can strike a database can also affect a web server. Of course, many web servers are part of client/server applications that query back-end database servers to service client requests. So, anything affecting the database server will have an adverse effect on the web server as well. However, there are many other places within the web server environment where things might go awry. There are many new places for bugs to crop up, including in the Common Gateway Interfa...
3. Your system fails because the operating system panics
Renewability Let’s say your system fails because the operating system panics. It reboots, restarts applications such as web servers and databases, and continues on as before the failure. What’s the probability of another failure due to an operating system panic? In all likelihood, it’s exactly the same as it was before the reboot. There are many cases, however, in which repairing a system changes the MTBF characteristics of the system, increasing the probability of another failure in the near-te...
4. Direct and Indirect Costs of Downtime
The Costs of Downtime The only way to convince the people who control the purse strings that there is value in protecting uptime is to approach the problem from a dollars-andcents perspective. In this section, we provide some ammunition that should help make the case to even the most stubborn manager. Direct Costs of Downtime The most obvious cost of downtime is probably not the most expensive one: lost user productivity. The actual cost of that downtime is dependent upon what work your user...
5. COST OF DOWNTIME IS NOT A CONSTANT
Further complicating matters is the fact that the cost of downtime is not a constant. We will assume it to be constant for the purposes of our calculations (it makes them much, much simpler), but in reality, the cost of downtime increases as the duration of an outage increases. Consider again the effects of downtime on an e-commerce site. If the site suffers a brief outage (a few seconds), the cost will be minimal, perhaps even negligible. An outage of a minute or less probably will not affect business too badly: All...
6. The Politics of Availability
To persuade others of the value of your ideas, it is necessary to delve into the dark, shadowy world of organizational politics. Fundamentally, this means that you achieve your goals by helping (or if you aren’t particularly scrupulous, appearing to help) others around you achieve their goals, so that they then help you achieve yours. Start Inside Probably the best way to convince others of the value of your ideas is to first convince them that your ideas will help them achieve their own goals. To do that, yo...
7. Rational case that explains in nontechnical terms
Start Building the Case Once you have learned what you need to know, the next step is to begin to put together a calm and rational case that explains in nontechnical terms what the vulnerabilities, risks, and costs are. The case must include a discussion of the risks of inaction. Find Allies Ask around your organization. Look for friends and colleagues who share your concerns. Maybe you’ll find someone who has tried to convince management of something in the past. At the very l...
8. 20 Key High Availability Design Principles 1
#20: Don’t Be Cheap One of the basic rules of life in the 21st century is that quality costs money. Whether you are buying ice cream (“Do I want the Ben & Jerry’s at $4.00 per pint, or the store brand with the little ice crystals in it for 79 cents a gallon?”), cars (Rolls-Royce or Saturn), or barbecue grills, the higher the quality, the more it costs. The decision to implement availability is a business decision. It comes down to dollars and cents. If you look at the business decis...
9. Consolidate Your Servers
#16: Consolidate Your Servers The trend over the last few years in many computing circles has been to consolidate servers that run similar services. Instead of having many small singlepurpose machines or lots of machines running a single instance of a database, companies are rolling them together and putting all the relevant applications onto one or more larger servers with a capacity greater than all of the replaced servers. This setup can significantly reduce the complexity of your computing envir...
10. Documentation provides audit trails to work that has been completed
#13: Document Everything The importance of good, solid documentation simply cannot be overstated. Documentation provides audit trails to work that has been completed. It provides guides for future system administrators so that they can take over systems that existed before they arrived. It can provide the system administrator and his management with accomplishment records. (These can be very handy at personnel review time.) Good documentation can also help with problem solving. 1. The first audience is the...
