Fixing HTML Generating Code

an article added by: Albert Lichtblau at 06022007


In: Root » » HTML XHTML and CSS » Fixing HTML Generating Code

French Spanish Portuguese Italian German Japanese Chinese Korean Russian Arabic

The Big Clean-Up: Fixing HTML Generating Code (The Hard Part)

Overview

As horrible a task as updating static HTML pages may seem, static documents at least have the advantage of predictability. Once a document is converted to XHTML, it stays XHTML unless someone actually modifies it. Code that generates HTML isn't nearly as predictable. You may think you've found all the glitches in the HTML it generates, and converted it to XHTML, but variations in how the code runs with different data may mean you have to come back for more. If you really want to generate conformant XHTML, you may find it more convenient to change some aspects of your coding style rather than just fix the existing code. XHTML's stricter syntactical rules impose discipline that hasn't existed in Web applications before and they make "best practices" considerably more important now than in the past.

Y2K Revisited? Programmers recently survived an exercise in code archaeology and repair – cleaning up Year 2000 bugs – but the costs of that project were enormous. While failure to convert to XHTML isn't likely to shut off the world's power, disrupt emergency services, or devastate the world economy, the cleanup process is nearly as complicated as the work that was done on Y2K. Even though Web applications are newer, many of them have been written in an immense hurry by developers who since have moved on to other projects. Some developers, especially those fond of writing "obfuscated" code, have created large amounts of code that does what it's supposed to but is difficult to manage or modify. Proper coding practice may have been understood better in the relatively short period the Web has existed than in the period when most of the code that wasn't Y2K-compliant was written – but those practices haven't been honored necessarily. Hack-and-slash code, cut-and-pasted out of various examples, has been popular on the Web (even encouraged). (Mea culpa: I've done plenty of this kind of code hacking myself.) Because HTML browsers have been forgiving devices, there just hasn't been a need to make sure that all the i's are dotted and the t's crossed. When it looks okay in a few dominant browsers and it doesn't crash the server, a project is often ready to go. The distinction between prototype and deliverable has blurred considerably.

Even for situations in which proper coding practice is followed and program behaviors are well understood and manageable, corner-cutting often occurs that makes it difficult to move programs from HTML to XHTML. Designers seeking to shrink their file sizes have figured out HTML's rules (and the rules in browsers) regarding when and where end tags are needed and how best to shrink small amounts of whitespace in a quest for the smallest possible files. These kinds of shortcuts make sense when generating pages on a large scale – and XHTML slightly raises the cost of document generation because it requires the use of end tags – but they make it more difficult to switch an application from generating HTML to generating XHTML. If you or your organization chooses to switch to XHTML, you most likely have a lot of work to do. The difficulty level of that work depends on the kind of code you have to deal with – not so much which language it is written in or what environment it runs in, but how it was structured and documented. You may want to survey the code you have before deciding whether to make the leap in order to have some estimate of the costs compared with the benefits. You may not have a choice, of course, if your customers or your organization want to apply XML-based tools to the documents you create. Then you just have to dig in.

Preliminaries Some of the changes you need to make to your code are minor, even cosmetic, although the implications of the choices you make may be more complex. For your code to be conformant XHTML, you need to label it as such with a DOCTYPE declaration, an XHTML namespace, and possibly an XML declaration. Adding these pieces to code generation isn't very difficult (usually), but they may determine and modify some of your future paths through the application. Choosing a DOCTYPE for generated code may be harder than choosing a DOCTYPE for static code because generated HTML may come up differently depending on circumstance. Developers who already have code that produces variations (typically for different brands and versions of browsers) may have to examine the code they produce to see how well (indeed if) it can fit any of the W3C XHTML DTDs. Sites built using features that XHTML doesn't support (such as Netscape's LAYER element or Microsoft's XML element) may face some serious choices between a lot of work to produce genuinely conformant documents or less work combined with a prayer that the code never goes near a validating XML parser. Otherwise, the same rules apply to choosing a DOCTYPE for generated HTML as apply to static HTML. Sites that use frames have to use the frameset DOCTYPE, while sites that use the font element or align attribute have to use the transitional DOCTYPE declaration. Only sites that conform tightly to the strict set of rules should use the strict DOCTYPE declaration. Sites that generate relatively simple HTML (or had designers who were especially picky about writing to the W3C standards) probably have the best luck meeting this requirement. Adding the XHTML namespace to your html element shouldn't be that difficult, fortunately. You also may want to add language information (using the lang and xml:lang attributes) to the html element if you know what language your generated documents contain.

Pitfalls: Case-Sensitivity One of the easiest and most obvious changes in XHTML – the mandated shift to lowercase element and attribute names – can be one of the most frustrating for developers, at least for those who use uppercase markup in their code. While search-and-replace isn't that difficult when working with static HTML documents, where it's clear what represents markup and what doesn't, it can be a hassle inside of a program. Developers converting code from HTML to XHTML generation need to watch out for a variety of details that can disrupt the transition.

Variable and object names that include HTML element names can become inadvertent victims of quick search-and-replace approaches, possibly disrupting interactions with other program modules that don't generate HTML directly. Many developers rely on case for visual cues to make their code more readable – lowercase for names in the program and uppercase for the generated markup. That approach no longer works, although it can be reversed. Similarly, developers need to make sure that their case changes are thorough and that they modify code using HTML element and attribute names as arguments – not just code that creates elements and attributes. Programs that have gone to extreme lengths to separate markup from code, perhaps even creating tables of markup vocabularies, will be the easiest to change over. Programs that freely mix code and content (like most Active Server Pages and Java Server Pages) will be more difficult. This relatively simple-sounding change can impose some real costs, depending on the style of coding used. (If your code already uses lowercase, you can count yourself as very lucky!)

Pitfalls: Well-Formedness Generated markup that passes the "it looks okay in a browser" test may conceal some serious structural problems that prevent its easy use as XHTML. HTML permits a wide variety of syntactical variations that can't parse as XHTML (such as omitted end tags) and browsers allow many more possibilities (such as meaningless or repeated end tags and omitted quotes around attribute values). Depending on how your code was written, sorting out these issues may be extremely easy or frustratingly difficult. "Off-by-one" bugs, where loops end in slightly the wrong place, can complicate matters significantly, especially if those errors only appear in certain contexts. There are a number of common situations in which these problems occur. The paragraph element p often is used like the line break element br. Many early HTML developers treated p elements like the paragraph marks used in word processors to mark the ends of paragraphs, effectively treating p as a larger line break than br and not a container for paragraphs. The fragment shown here demonstrates this style:

   This is paragraph number one.<p>
   This is paragraph number two.<p>
   This is paragraph number three.<p>
 This is paragraph number four.<p>

Cleaning this problem out of text-generating code may be as simple as adding a slash (/) to the end tag and a <p> to the start of the paragraph in a template. Or it may mean tracking down the code that creates the paragraph and prefacing it with an extra tag. Another possibility is replacing <p> with </p><p>, although that may create an extra paragraph start tag at the end and leave off a <p> at the beginning of the series. List items have a similar problem, although typically in reverse. Many developers treat the <li> start tag as a stand-in for a bullet or number, not a container for a list item. As a result, code can look like this:

   <ul>
   <li> Item One
   <li> Item Two
   <li> Item Three
 </ul>

In a template-based approach, adding the closing </li> may not be difficult. In code that generates text explicitly, fixing this probably requires adding an extra bit of code that inserts the end tag where appropriate. Lists and tables share some code-structuring problems. Both are generated by code looping over incoming information; processing comes to an end when the information ends or when the information is no longer appropriate. Because you can skip some of the code at the end of the loop that generates these structures, it's possible that end tags for one or more elements won't be generated ever. Developers familiar with the need to close table elements already have some experience with this problem. (Many browsers don't display tables that don't have a clearly marked end.) XHTML conformance requires the table itself to be closed, as well as the cells, headers, and rows – and they must be closed in the sequence they were opened. Maintaining a stack of open elements isn't necessary, but it may become critical if your code takes multiple paths through the same information. XHTML elements that have no textual content, such as img and br, can cause problems in code that treats these empty elements as start tags and nothing more. If your code uses generic mechanisms to build start tags (taking element and attribute name-value pairs as arguments, for instance), you may need to add logic that checks the element name to see if it should be an empty tag or if it enables developers to specify it as an empty tag. The latter approach is more flexible, but also more prone to errors.

Web developers also use a number of techniques that may not cause problems in the environments they originally were built for, but that may cause problems if your XHTML documents move into a more XML-oriented environment. Server-side includes, for example, use HTML comments to store information that the server processes. Comments are a convenient mechanism for doing this because users don't see extra content should a server fail to include the content, unless they search the source for comments. Server-side includes – at least those that don't reference content that can disrupt wellformedness by including content with unbalanced markup, – should continue to work during a transition to XHTML. They may not prove portable however, if your document templates are parsed as XML before reaching the server-side include engine. XML provides other mechanisms for this kind of content referencing, called entities, but XHTML doesn't support the creation of external entities explicitly. Server-side includes are probably here to stay, but you may want to watch them closely – perhaps parsing content before shipping it to the users – if you perform major surgery on your code-generation architecture.

Pitfalls: Valid XHTML While making generated HTML well formed is difficult, making it valid XHTML is even more difficult. While some developers may check their results against the HTML 4.0 DTDs (using tools such as the W3C's HTML Validation Service), most don't; and the discipline of conforming to a particular document structure is a fairly new introduction for most Web developers. It isn't difficult to generate valid XHTML, but retrofitting older code can be tricky.

The problems involved in generating valid XHTML do not differ much from the problems involved in converting legacy HTML static documents to XHTML. Like their static document-creating counterparts, a lot of Web developers left out features such as the html, head, and body elements unless they had a particular use for them in their code. Title support and the use of meta elements to identify pages to search engines does mean that the head element commonly appears. Adding the basic structure of an XHTML document to code that left it out typically isn't that difficult – most developers don't try to specify the title element in the body of their document anyway. Starting a document with a DOCTYPE declaration and properly XHTMLized html element is reasonably easy, as is making sure that the document ends with </body></html>.

On the other hand, there's no simple way to ensure that your generated documents comply with the XHTML DTDs. Even code that generates document structures, as described in Article 13, may have a hard time testing the implications for document validity of adding a particular element or attributes. Furthermore, text generators can't do it at all until the document is complete. Generating valid XHTML (actually, any kind of valid XML) requires building the code that writes the documents around the DTD in some form. It doesn't mean that you can use a DTD's description of a document structure to generate code – most likely it means that the developers creating a particular document generation system need to be aware of the DTD and understand how the documents they create relate to that DTD.

Working within the constraints provided by DTDs isn't that difficult for new projects, although it requires programmers to have a much higher-level understanding of the markup vocabulary with which they are working. For legacy projects, however, developers need both that understanding and a thorough knowledge of how the old code worked. Cosmetic cleanups may catch gross errors, and even get past the well-formedness problems associated with old code, but document structure issues may require more cleanup. Making sure that block elements and inline elements mingle properly – or that all of the constraints on form structures are obeyed – may require a lot of testing and an eye for detail.

Note True XHTML conformance demands valid documents. But HTML parsers and non-validating XML environments may find it acceptable for your documents to be well formed XHTML that doesn't conform to a particular DTD. It won't pass a test from the W3C's Validation Service or any other validating XML parser, however.

Testing, Testing, Testing Whatever strategy you use for cleaning up your legacy code, the best way to make sure it works is to test it out against a broad range of possible situations. This does not differ much from the traditional testing of HTML code against as many browsers as possible to determine if the code looks right, but it's a slightly more formal process that probably is better accomplished with XML parsers than with armies of machines running browsers. (Hopefully, XHTML and more standards-compliant browsers will reduce the need for the old-style testing in the long run.)

Most of the tools for checking well-formedness or validity examine a single document at a time – excellent for small-scale work, but not especially useful if you need to check 4,000 pages or even 4,000 variations of a single code generation. Fortunately, options for testing multiple documents are starting to appear. The Web Design Group's WDG HTML Validator, available at http://www.htmlhelp.com/tools/validator/, is a Web-based tool for validating HTML and XHTML documents and sites. It includes a batch mode (http://www.htmlhelp.com/tools/validator/batch.html) that accepts a list of URLs and reports on the conformance of all of those documents. (There seems to be a limit of 60, but it's a start.) You also can build your own tools on top of existing XML parsers, although hopefully more tools of this kind will become widely available as more developers transition to XHTML.

Tip Liam Quinn, the maintainer of the WDG HTML Validator, also maintains a list of other validation tools at http://www.htmlhelp.com/links/validators.htm. Tools alone are not likely to solve all of your testing problems. Testing for XHTML conformance is a process you usually need to perform at multiple stages of development. In addition, it may require placeholders at various points in the process for incomplete work. Because the validation process doesn't provide any simple means of "validating only this far into the document, but don't worry about the missing pieces," it may be difficult to use automated testing processes on intermediate phases of work. In these cases, at least until someone develops a more controllable set of testing tools, a human being who reads markup and compares it to a thorough understanding of XHTML 1.0's structures probably presents a better approach to testing.

legal disclaimer

Our website is not responsible for the information contained by this article. Web-articles is a free articles resource.
Suggestion: If you need fresh, daily updated content for your website, feel free to use our service. Click here for more information.

related articles

1. lang Internationalization
Internationalization: xml:lang and lang Internationalization (often abbreviated i18n because 18 characters appear between the i and the n) gets a significant boost with the shift to XML primarily because of XML's use of Unicode as the underlying character model. While not every document needs to encode Chinese, Cyrillic, Arabic, and Indian characters, Unicode makes it possible for all of these forms to exist within a single document. In addition, XML and XHTML allow for the possibility of other e...

2. Anatomy of an XHTML Document
The transition from HTML to XHTML will come with a fair number of bumps. While later chapters introduce tools to help you get past those bumps – and figure out where they come from – this chapter examines what's going to change and demonstrates a few strategies for handling those changes. Along the way, we visit the ghosts of browsers past and explore problems that exist in current browsers. In turn, you discover how prepared and unprepared various tools are for XHTML. Note Som...

3. Converting to strict HTML and XHTML
Converting to strict HTML You start out by declaring your intentions to use the strict HTML 4.01 DTD by putting the appropriate DOCTYPE declaration at the head of the document: <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"> Now the first section of the document, including the HTML opening tag and the HEAD element and its contents, is fine except for one line. The SCRIPT element no longer supports a LANGUAGE at...

4. Reading the XHTML DTDs A Guide to XML Declarations
Reading the XHTML DTDs: A Guide to XML Declarations Although the W3C has long had document type definitions (DTDs) for HTML, few developers actually use those DTDs as a foundation for learning HTML. XHTML 1.0 simplifies those DTDs with the slightly friendlier XML syntax – they previously used SGML's more complex syntax – and the increased emphasis on validation may lead developers to explore them more closely. Making good use of XHTML 1.1 requires some level of ...

5. Defaulting attribute values XHTML DTDs
XML 1.0 also provides a set of tools for specifying what happens if an attribute isn't declared within an element. Four different possibilities exist, including "the attribute just isn't there"; "the attribute must be there, period"; and "the attribute has this value, period." You already have seen a few uses of these choices in the preceding declarations. In the img element, for instance, the src and alt attributes are required (#REQUIRED); meanwhile, most of the rest of its attribute content is optio...

6. Exploring the XHTML DTDs
Exploring the XHTML DTDs Choosing Your DTD XHTML 1.0 provides three DTDs that describe different sets of XHTML elements and reflect the three choices provided in HTML 4.0: strict, transitional, and frameset. The probably the one that the W3C would like to see developers adhere to, but transitional DTDs reflect the reality of HTML usage much more accurately. Appendix A lists the in the three different DTDs, along with notes regarding attributes. To identify the DTD for a ...

7. Building XHTML DTD Structure Element and Attribute Declarations
Building Structure: Element and Attribute Declarations After all of these preliminaries, it's finally time to make some real declarations, creating the elements and attributes partly described by the entities established so far. This portion of the DTD is broken down into segments that reflect groupings of element types, foreshadowing to some extent the modularization process that XHTML 1.1 will perform. If you have trouble getting your XHTML documents to validate, you need to explore this portion of the ...

8. Style Sheets and XHTML
Cascading Style Sheets (CSS) is an enormously powerful tool that has been slow to catch on in the HTML development world. Whether or not you use (or like) CSS, the continuing evolution of CSS is deeply intertwined with the work moving forward on XHTML so learning about CSS can help you understand XHTML as well as implement it. Fortunately, CSS isn't very difficult once you master a few key structures and learn to apply its vocabulary. There are some real problems with existing CSS implementations that I cover later...