In: Categories » Computers and technology » HTML XHTML and CSS » Moving From HTML to XHTML
Overview Hypertext Markup Language (HTML) is getting an enormous and overdue cleanup. Much of HTML's early charm as browsers reached a wide audience was the ease of use created by browser tolerance for a wide variety of syntactical variations and unknown markup. Unfortunately, that charm has worn thin through years of "browser wars" and demands for new features that go beyond presenting documents. The World Wide Web Consortium (W3C) is rebuilding HTML on a new foundation, preserving HTML's well-understood vocabulary while preparing the way for a very different style of processing. In some ways, the W3C is returning HTML to its roots – rebuilding it as an information format that can be processed and reused, and discarding some of the wreckage created during the browser wars between Microsoft, Netscape, and a host of smaller participants. The new framework is Extensible Markup Language (XML) – a generic syntax for documents that has much stricter rules than HTML. By combining the old HTML vocabulary with the stricter XML syntax, the W3C hopes both to reinvigorate HTML and open the door to major expansions of the Web's capabilities. The benefits of XHTML won't all come for free, however. Developers will have to learn a few basic rules in order to take advantage of XHTML, and adoption probably will be fairly slow. While XHTML is mostly compatible with HTML, many older HTML documents decidedly are not compatible with XHTML. Some developers, notably those creating dynamic HTML documents, already have encountered the need for stricter and more consistent structures. It's very hard to create 'dynamic' documents if the scripting logic can't consistently reference points in a document. Still, others need to be convinced of the benefits before moving forward. Like almost all standards, using XHTML makes more sense and costs less when more people use it. While XHTML can fix a lot of problems in some areas, early adopters of XHTML likely will be application developers rather than Web site managers. XHTML brings application developers a new set of (largely free) tools from the XML world that can simplify development and information management more than enough to make up for the minor inconveniences that XHTML imposes on HTML developers.
HTML: Describing Documents HTML began as a very simple tool that marked up documents for exchange over a network. Because the markup was so simple and the description for it open and available, it was easy to present the same information on different computing platforms. Because browsers were relatively easy to write at first, developers wrote simple browsers and an explosion of new browsers appeared – each with its own slightly different take on how to present the information described by HTML. These different takes were part of the original plan for HTML to describe document structures, such as headlines and paragraphs, rather than describe explicit formatting. By sticking to high-level descriptions, HTML avoided the snarl of formatting that had made many previous systems accessible only to advanced users. As a result, HTML spread far more quickly than Tim Berners-Lee, its inventor, had dared hope. People around the world began composing HTML directly – something that he never expected would happen. (Originally, HTML was expected to be a format hidden behind tool frameworks; one of the earliest pieces of HTML software was a WYSIWYG editor that Berners-Lee built himself.) This mostly structural approach led to high hopes for a number of ventures, notably automated agents that would scour the Web for information and present it to users when they wanted it. These tools could interpret the basic structures of HTML documents along with their content to gather information in a more sophisticated way than was possible with simple text-based searches. Some HTML management software also used these structures as the foundation for more efficient handling of large sets of documents. These possibilities were foiled quickly, however, by several developments in the HTML world that made HTML very difficult to process.
The snarl of formatting quickly arrived as HTML spread to more demanding users. While the Web provided the first Internet medium that was used easily (and not especially controversially) as an advertising medium, the simple high-level approach that had spurred its early growth quickly became anachronistic. Users demanded more control over formatting; they objected especially to the variations in appearance that appeared when the same document was opened in different browsers. The tools that were available to generate HTML took wildly divergent approaches to how they built pages; some relied on tables for precise placement of images and text while others used more flexible approaches for different kinds of information. The early hand-coders of HTML, developers close to the markup, soon found their skills in significant demand because they understood how to intervene in a document directly to achieve certain effects across browsers that the automated tools weren't flexible enough to support. Even hand-coded HTML rapidly became more intricate and more commonly aimed at formatting consistency than document structures. As the W3C began to take control of HTML, it tried repeatedly to stamp out these features. The W3C did this by marking key formatting tools such as the FONT tag as "deprecated" and creating more formal descriptions of HTML using document type definitions (DTDs) from HTML's original inspiration, the Standard Generalized Markup Language (SGML). While W3C Recommendations for HTML provided a foundation that application developers could reference, these recommendations have had relatively little effect on the main body of HTML developers. These developers continue to create documents using tools and methods that work in browsers, with little concern for how they might fit the rules of a standard specified using an obscure formal language. While the chaos of the browser wars has settled down a little as Netscape and Microsoft have curbed their onslaught of new features, HTML itself is a snarled mess (even if it is one that people are accustomed to).
XML: A Structured Way to Describe Information XML began with a different set of premises than did HTML. While HTML set out to describe documents, XML set out to create tools that developers could apply in describing any kind of structured information. While XML shares HTML's syntactical inheritance from SGML, such as the use of < and > as markup delimiters, XML provides no vocabulary, thus enabling developers to create their own tag sets, effectively new vocabularies. At the same time that it opens up the vocabulary possibilities, however, XML slams the door on a wide variety of structural variations that were common in HTML and even in SGML. XML simplifies and adds extra rigidity to SGML, while HTML was effectively an application of SGML. The simplifications and extra constraints of XML are designed to make XML documents exceptionally easy to process, providing a level of consistency that HTML didn't guarantee. You can describe XML documents in terms of containment. A single root element may contain other elements, attributes, and textual content; child elements themselves may contain a similar mix. While HTML developers often talk about tags – the markup used to begin and end elements – they rarely focus on the element structures those tags create. (Significant exceptions exist among dynamic HTML developers and others with more need for structure, certainly.) XML developers use tags that look exactly like HTML tags, but XML developers are more concerned with creating clean structures in which all of those tags produce elements that are organized neatly. Every element has a start tag and an end tag (or something new, a tag representing an empty element), and those tags are arranged so that element boundaries never overlap. These stricter rules make it very easy for an application to figure out and combine the structure of the document with its understanding of the document content to perform further processing.
HTML + XML = XHTML
By combining the well-understood vocabulary of HTML with the clean structures of XML, the W3C is building something new: XHTML. XHTML at its simplest is just HTML cast into XML syntax, with a few tips for making sure that XML syntax doesn't interfere with older browsers. XHTML should work just fine in older software, although HTML has a harder time getting into XHTML software. Cleaning up HTML in this way has many useful effects. For example, it provides XHTML tool creators with additional development frameworks and document repositories (originally built for XML). It also establishes a firm set of rules for document structure that can simplify projects such as dynamic HTML in which scripts manipulate the document structure. For the average user, nothing is likely to change on the surface, but what happens below that surface will be more efficient. The long-run implications of the move to XHTML dwarf these simple effects of XHTML 1.0, however. Applying XML's stricter rules to HTML document structures opens up a new range of possibilities, including some significant changes to the HTML vocabulary itself. While version 1.0 of XHTML simply recasts HTML 4.0 into XHTML, versions 1.1 and beyond start to move beyond tried-and-true HTML. XHTML enables developers both to shrink the HTML vocabulary (by declaring that they only use portions of it) and to expand the HTML vocabulary (enabling them to supplement HTML with MathML, SVG, and other markup vocabularies.) Instead of leaving HTML as a monolithic set of vocabulary, the W3C is breaking HTML into modules and providing tools for additional modules. Rather than try to shoehorn every piece of information into some HTML representation, developers will create their own vocabularies and integrate them with those of HTML. Then HTML documents could contain that information as XML, or XML documents could contain portions of HTML.
For example, while cellular telephones and personal digital assistants (PDAs) have become increasingly powerful tools, they still have limited processing power and graphic displays that typically are black and white instead of color. These lightweight devices may not support images, ActiveX objects, Java applets, or even (in some cases) forms. While users may enjoy surfing the Web from their cell phones – sometimes meetings do last forever – the phones aren't really up to the tasks demanded by a full Web browser. By implementing a subset of XHTML, however, these phones can tell servers that they can accept information only in that subset, thereby giving the server the chance to build Web pages designed specifically for cell phones with that particular profile. Informative (hopefully) but lowbandwidth text will replace full-color images, enabling cell phone users to retrieve information from the Web without having to throw most of it away.
Going the other direction, XHTML opens the door to the Web's original application of exchanging information among scientists. While HTML and the Web are important to this community, HTML's lack of math support has hindered a lot of work. (It appeared briefly in HTML 3.0, then disappeared with 3.2, without widespread implementation.) The W3C created another specification, MathML, which provides support for a wide variety of mathematical information from simple equations to integrals, square roots, and all kinds of symbols. By adding MathML as a module to XHTML, scientists can integrate these two vocabularies. They also need an application that supports both vocabularies, of course, and those are mostly still in development. The W3C's Amaya browser and the developing Mozilla browser both support XHTML and MathML, but building these modules is difficult.
Note Microsoft already has pioneered the use of XML data islands in Office 2000 documents — a strategy somewhat like that just described. The HTML surrounding these data islands is not XHTML, however, and Microsoft's approach is not supported by any activity at the W3C. Hopefully, future versions will support XHTML rather than data islands. (Office 2000 documents aren't a total loss, however. As you see later, there are tools for converting them into XHTML just like any other HTML document.)
Finally, XHTML is easy to use within XML documents and it provides a well-known format for documentoriented information. Any time an XML developer needs to add a space for extra description of something, XHTML is available. XHTML enables these developers to use an HTML vocabulary, thereby taking advantage of the many tools available for processing, creating, and displaying HTML (including components that can be built into non-browser applications). XML developers don't need to reinvent the wheel every time they must include a few paragraphs or some images because they can build on what came before without fear of "contaminating" their XML with old-style HTML. While the first steps toward XHTML – shown in the first two figures – may seem like they require a lot of extra work without much payback, you have to view them as a beginning not an end unto themselves. XHTML is an enabling technology, something that lets developers build new technologies on old; it is not a technology that produces instant transformations of existing work.
The rest of this book details the many steps involved in transitioning from HTML to XHTML and eventually XHTML+XML, giving you both the established HTML vocabulary and the potential to extend that vocabulary. The first few chapters show you HTML from an XML perspective, and provide you with the XML foundation you need to understand XHTML. (It's a remarkably small foundation, fortunately.) After the basic tour, you start applying XHTML to existing sites and exploring ways to clean up your coding processes for new documents, as well as converting your older documents to the new standard. The hardest part is cleaning up your applications, but we look at some tools that may help you avoid the painful process of reading and testing all of your code by hand. In some cases, you should be able to upgrade to XHTML by changing a line or two of code – it all depends on the application's architecture. In addition to cleaning up existing applications, you explore the newer tools XHTML makes available for building applications including the transformative power of Extensible Stylesheet Language (XSL) and the Document Object Model (DOM). By shifting your work to some of these tools, you may be able to move between HTML and XHTML by changing a single setting while simultaneously getting powerful tools that reduce the amount of code you need to write.
Along the way, you explore the costs and benefits of XHTML. Unfortunately, XHTML makes more demands than any previous upgrade of HTML. At the same time, however, it offers capabilities that Web application developers dreamed of for years but had to work without. XHTML promises these developers much more powerful tools for tasks such as exchanging information in client-server applications, as well as more sophisticated methods for providing appropriate content to many different kinds of clients. While not everyone will jump on the XHTML bandwagon initially, those who do will have access to a much larger set of possibilities than those who only see the costs.
legal notice
Our website is not responsible for the information contained by this article. Web-articles is a free articles resource.
Suggestion: If you need fresh, daily updated content for your website, feel free to use our service. Click here for more information.
Useful tools and features
related articles
Processing instructions XML also enables developers to pass information to the application through processing instructions (often called PIs). Processing instructions use a similar syntax to the XML declaration, although the rules for them are much less strict. Processing instructions begin with <? and end with ?>, but the developer generally dictates their contents. The first bit of text before a space appears in a PI is called the target. The target must start with a letter, unde...
2. lang Internationalization
Internationalization: xml:lang and lang Internationalization (often abbreviated i18n because 18 characters appear between the i and the n) gets a significant boost with the shift to XML primarily because of XML's use of Unicode as the underlying character model. While not every document needs to encode Chinese, Cyrillic, Arabic, and Indian characters, Unicode makes it possible for all of these forms to exist within a single document. In addition, XML and XHTML allow for the possibility of other e...
3. Anatomy of an XHTML Document
The transition from HTML to XHTML will come with a fair number of bumps. While later chapters introduce tools to help you get past those bumps – and figure out where they come from – this chapter examines what's going to change and demonstrates a few strategies for handling those changes. Along the way, we visit the ghosts of browsers past and explore problems that exist in current browsers. In turn, you discover how prepared and unprepared various tools are for XHTML. Note Som...
4. Converting to strict HTML and XHTML
Converting to strict HTML You start out by declaring your intentions to use the strict HTML 4.01 DTD by putting the appropriate DOCTYPE declaration at the head of the document: <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"> Now the first section of the document, including the HTML opening tag and the HEAD element and its contents, is fine except for one line. The SCRIPT element no longer supports a LANGUAGE at...
5. Reading the XHTML DTDs A Guide to XML Declarations
Reading the XHTML DTDs: A Guide to XML Declarations Although the W3C has long had document type definitions (DTDs) for HTML, few developers actually use those DTDs as a foundation for learning HTML. XHTML 1.0 simplifies those DTDs with the slightly friendlier XML syntax – they previously used SGML's more complex syntax – and the increased emphasis on validation may lead developers to explore them more closely. Making good use of XHTML 1.1 requires some level of ...
6. Defaulting attribute values XHTML DTDs
XML 1.0 also provides a set of tools for specifying what happens if an attribute isn't declared within an element. Four different possibilities exist, including "the attribute just isn't there"; "the attribute must be there, period"; and "the attribute has this value, period." You already have seen a few uses of these choices in the preceding declarations. In the img element, for instance, the src and alt attributes are required (#REQUIRED); meanwhile, most of the rest of its attribute content is optio...
7. Exploring the XHTML DTDs
Exploring the XHTML DTDs Choosing Your DTD XHTML 1.0 provides three DTDs that describe different sets of XHTML elements and reflect the three choices provided in HTML 4.0: strict, transitional, and frameset. The probably the one that the W3C would like to see developers adhere to, but transitional DTDs reflect the reality of HTML usage much more accurately. Appendix A lists the in the three different DTDs, along with notes regarding attributes. To identify the DTD for a ...
8. Building XHTML DTD Structure Element and Attribute Declarations
Building Structure: Element and Attribute Declarations After all of these preliminaries, it's finally time to make some real declarations, creating the elements and attributes partly described by the entities established so far. This portion of the DTD is broken down into segments that reflect groupings of element types, foreshadowing to some extent the modularization process that XHTML 1.1 will perform. If you have trouble getting your XHTML documents to validate, you need to explore this portion of the ...
9. Style Sheets and XHTML
Cascading Style Sheets (CSS) is an enormously powerful tool that has been slow to catch on in the HTML development world. Whether or not you use (or like) CSS, the continuing evolution of CSS is deeply intertwined with the work moving forward on XHTML so learning about CSS can help you understand XHTML as well as implement it. Fortunately, CSS isn't very difficult once you master a few key structures and learn to apply its vocabulary. There are some real problems with existing CSS implementations that I cover later...
