Reading the XHTML DTDs: A Guide to XML Declarations
Although the W3C has long had document type definitions (DTDs) for HTML, few developers actually
use those DTDs as a foundation for learning HTML. XHTML 1.0 simplifies those DTDs with the slightly
friendlier XML syntax – they previously used SGML's more complex syntax – and the increased
emphasis on validation may lead developers to explore them more closely. Making good use of XHTML
1.1 requires some level of understanding of DTDs, so getting started now is a good idea. Fortunately,
XHTML doesn't use every tool XML provides; figuring out XHTML is easier than learning all about XML.
Note
The W3C is moving slowly toward its new XML Schemas standard for describing
document structures. You'll want to learn XML Schemas when they're ready, but
the DTDs described in this structure provide a solid foundation for figuring them
out.
You can work with XHTML 1.0 without any comprehension of the DTD because the rules for element
and attribute usage are the same as those for HTML 4.0. However, if you plan on using validating
parsers with XHTML 1.0, you should know about DTDs to figure out some of the error messages you
may encounter. In addition, understanding DTDs can help you out considerably with XHTML 1.1 and its
modular approach.
Note
Because you don't necessarily need to understand DTD syntax to use XHTML,
you're welcome to bail out of this article if you prefer, and come back to it if and
when you need it.
The W3C wrote the XHTML DTDs for its own convenience, making them more manageable (and at an
abstract level, more readable) – but at the cost of requiring some cross-referencing to figure out exactly
what's included in a particular element or attribute. As a result, the XHTML DTDs aren't recommended
reading for developers without an XML or SGML background. The following sections introduce the
different kinds of declarations used within the XHTML DTDs in their simpler forms, building up to the
more complex rules used to assemble the XHTML 1.0 DTDs.
Tip If you want a guide to creating and reading XML DTDs in all their glory, try XML: A
Primer, 2nd Edition by Simon St. Laurent (IDG Articles, 1999). For even more
details on XML technicalities, see XML Elements of Style (McGraw-Hill, 1999),
also by Simon St.Laurent.
Element Type Declarations
Every valid document needs one or more element type declarations, which describe element names
used within a document and content that appears within a given element. If an element name appears
in a document and there is no corresponding element type declaration, validating parsers report an
error. (Some parsers also halt processing, although that isn't required.) Similarly, if an element appears
in a context where it's not supposed to appear, validating parsers report errors.
The syntax for element type declarations is simple:
<!ELEMENT elementName contentModel>
Element names must begin with letters, underscores, or colons, and they may contain letters,
underscores, colons, digits, hyphens, and periods. Element names beginning with xml (or any case
variation on that, such as XmL or XML) are reserved for the use of the W3C. The use of colons is
discouraged except for use with namespaces, which Article 4 describes.
Content models can be a lot more complicated, enabling designers to specify intricate combinations of
elements and text. There are four basic types of content models available: EMPTY, ANY, structured
content models, and mixed content models.
Note
Element type declarations don't provide any background on what the element is
for, what contexts it may be used in, or what its appearance in a given context
might mean. You have to provide that information separately, typically in
documentation. Element type declarations only describe a small, but important,
set of element properties: name and allowed contents.
The EMPTY content model
The EMPTY content model is the simplest model available. EMPTY elements may either use empty
element tags or a set of start and end tags with no content whatsoever (not even whitespace) between
them. However, they may (and usually do) store information in attributes, which are declared separately.
The img and br elements are both examples of elements with EMPTY content models, and their
declarations are very similar:
<!ELEMENT img EMPTY>
<!ELEMENT br EMPTY>
The ANY content model
The ANY content model is nearly as simple as the EMPTY model. Elements declared as ANY can contain
any mix of text and (declared) elements. The ANY content model is never used within XHTML 1.0, but it
sometimes appears in XML documents that contain XHTML content (perhaps followed by a comment):
<!ELEMENT documentation ANY>
<!--Please note: XHTML is the preferred content for the
documentation element, but other models may be used.-->
XML developers frown on the widespread use of ANY, seeing it as introducing serious weaknesses, but
you may use it appropriately in your own DTDs at the beginning of a project or to preserve spaces for
future extensions.
Using this decoder key, you can translate the content model of the table element type's declaration and
its pieces into English. The outside parentheses just enclose the entire content model – a requirement
for structured content model declarations.
The first item inside the parentheses, caption?, indicates that a caption element may appear once
as the first element inside the table element (but it is optional). The comma following caption?
indicates that the other items following it must appear in sequence. The next chunk provides some
options:
(col*|colgroup*)
This grouping means that either col or colgroup elements may appear after the caption and before
the thead (if they appear), but that col and colgroup elements may not be mixed within a given table
element. This chunk of markup says that either zero or more col elements or zero or more colgroup
elements may appear at this point. If the developers of the XHTML standard had wanted to allow col
and colgroup elements to be mixed, they could have written:
(col|colgroup)* <!--this is not the route XHTML chose-->
This says that zero or more instances of the col or colgroup elements may appear, without
prohibiting both from appearing in a single sequence.
A comma follows the (col*|colgroup*) grouping, followed by thead?. Like caption?, this allows
the thead element to appear zero or one times. The comma following then permits tfoot? to indicate
the possible appearance of a tfoot element zero or one times.
The last portion of the content model is similar to the (col*|colgroup*) grouping, but with a slight
change:
(tbody+|tr+)
Again, either tbody or tr elements may appear in this location within the content model. However, at
least one instance of one of these elements is required for a valid document. This is the only required
content within a table element. No instance of the table element may appear without containing at
least a tbody or a tr element.
Mixed content models
Most of HTML's elements contain mixed content models, which enable document authors to mix text
and elements together to create Web pages. Mixed content models in XML come in two varieties. The
simpler variety enables you to create elements that may contain only text: The title element, for
example, may contain only text:
<!ELEMENT title (#PCDATA)>
PCDATA stands for parsed character data, the only one of SGML's textual types that XML supports.
You can write the same declaration like this:
<!ELEMENT title (#PCDATA)*>
The asterisk is optional when a text-only element is declared, but the asterisk makes it more consistent
with other mixed content models.
Mixed content models that describe the mixture of text and elements are more complicated. They look
like structured content models, using the | and * indicators, but you are very limited in how you can use
them. The general syntax for an element type declaration using mixed content of this kind looks like this:
<!ELEMENT elementName (#PCDATA | child1 | child2 | ...)*>
Mixed content models only enable you to list a set of elements that may appear mixed with text, but you
cannot specify their sequence or the number of times they may appear. For example, if a very simple
paragraph element only contains text mixed with bold and italic elements, the declarations might look
like this:
<!ELEMENT bold (#PCDATA)>
<!ELEMENT italic (#PCDATA)>
<!ELEMENT paragraph (#PCDATA | bold | italic)*>
Based on those declarations, all of the paragraphs shown here are legal:
<paragraph>There's just text in this one!</paragraph>
<paragraph><bold>This one's bold!</bold></paragraph>
<paragraph><italic>This one's italic!</italic></paragraph>
<paragraph><bold>This one's part bold</bold> and <italic>part italic!</italic></paragraph>
<paragraph><italic>This one's part italic</italic> and <bold>part bold -
</bold> and then <bold>bold again!</bold></paragraph>
Mixed declarations are used throughout the XHTML 1.0 DTDs; to understand their usage there, you
need to know about parameter entities (which I cover later in this article).
Attribute List Declarations
Attribute list declarations enable you to specify attributes that you can use on particular element types.
Every element in XHTML has at least one core set of attributes so attribute list declarations (sometimes
abbreviated ATTLIST declarations) are an important part of the XHTML 1.0 DTDs. You have more
options for attribute list declarations than element type declarations in XML, but fortunately the XHTML
1.0 specification stays away from the most complicated types of attributes.
The basic syntax for an attribute list declaration looks like this:
<!ATTLIST elementName
attName attType default
attName attType default
... >
Multiple attribute list declarations may appear for a single element type, although the first definition of a
particular attribute for a given element is the one that gets used in repeated definitions. Any number of
attributes may be defined for a particular element in a given attribute list declaration, even none:
<!ATTLIST myElement>
Attribute names are subject to the same rules as element names: they must begin with letters,
underscores, or colons, and may contain letters, underscores, colons, digits, hyphens, and periods.
Attribute names beginning with xml (or any case variation on that, such as xMl or XML) are reserved for
the use of the W3C. Furthermore, the use of colons is discouraged except for use with namespaces.
The simplest type of attribute is the CDATA type, an abbreviation for Character DATA. The simplest
default is the keyword #IMPLIED, which doesn't supply any default value for the attribute. A very simple
attribute declaration might look like this:
<!ATTLIST myElement
note CDATA #IMPLIED>
The following sections discuss the attribute types and default options in more detail.
Types of attributes
Let's take a look at how these attributes are used by exploring subsets of the declarations employed in
the XHTML DTD. The DTD uses parameter entities, covered later in this article, and smaller examples
are easier to work with, so we'll create examples that are easy to read but aren't the exact quote from
the XHTML DTD. Also, as you'll see, the W3C uses parameter entities to specify expectations for
attribute content that can't be expressed using the basic types.
Attributes of type CDATA appear throughout the XHTML DTDs. CDATA is the loosest model,
accommodating all kinds of needs while setting very few expectations. CDATA attribute types can hold
URLs, numeric information, style information – basically anything that can be expressed as text. A
subset of the attribute list declaration for the img element, for example, might look like this:
<!--These are compatible with the XHTML DTDs but do not represent the complete declarations from
the XHTML DTD-->
<!ATTLIST img
src CDATA #REQUIRED
alt CDATA #REQUIRED
height CDATA #IMPLIED
width CDATA #IMPLIED
>
The src attribute, which takes a URL, is represented as CDATA. The alt attribute, which contains text
to display if the image isn't loaded, also is represented as CDATA despite the differences between its
content and that of the src attribute. The height and width attributes, which accept lengths, also use
CDATA. CDATA can handle all of these different types because it places so few restrictions on its
content.
The XHTML 1.0 Recommendation names all of its attributes of type ID as id and makes them available
to every single element in the DTD. To add the ID element to the img element, you just use this:
<!ATTLIST img
id ID #IMPLIED
>
Or add this to the preceding list:
<!--These are compatible with the XHTML DTDs but do
not represent the complete declarations from the XHTML DTD-->
<!ATTLIST img
src CDATA #REQUIRED
alt CDATA #REQUIRED
height CDATA #IMPLIED
width CDATA #IMPLIED
id ID #IMPLIED
>
The IDREF and IDREF attribute types are used more sparingly. The label element, which enables the
creation of labels for all elements in a document, has a for attribute that should contain an ID value
describing the content being labeled:
<!--This is compatible with the XHTML DTDs but does
not represent the complete declarations from the XHTML DTD-->
<!ATTLIST label
for IDREF #IMPLIED
This mechanism allows the label to refer to one and only one element in a document – the one that has
an id attribute value matching that of the label's for attribute.
IDREFS are used similarly, although they permit a single attribute to refer to multiple ID values. XHTML
1.0 uses IDREFS to allow table cells to point to the header labels that describe them:
<!--This is compatible with the XHTML DTDs but
does not represent the complete declarations from the XHTML DTD-->
<!ATTLIST td
headers IDREFS #IMPLIED
>
<!ATTLIST th
headers IDREFS #IMPLIED
>
Complex tables sometimes sprout multiple levels of headers; this can help manage table reorganization
or analysis. For instance, XHTML uses the NMTOKEN attribute type to restrict content to a single word. In
the a, map, and object elements, NMTOKEN is used to restrict the value of name attributes to the same
rules that apply to id attributes:
<!--This is compatible with the XHTML DTDs but does not represent the complete declarations from the
XHTML DTD-->
<!ATTLIST a
name NMTOKEN #IMPLIED
id ID #IMPLIED
>
<!ATTLIST map
name NMTOKEN #IMPLIED
id ID #IMPLIED
>
<!ATTLIST object
name NMTOKEN #IMPLIED
id ID #IMPLIED
>
XHTML uses enumerated attributes to restrict the values for an attribute to a small set of permitted
choices, presented as a list. Enumerated attributes appear throughout the DTDs. The use of an
enumerated attribute type to restrict values is useful especially for input elements in which the type
attribute defines the "real" meaning of the element:
<!--This is compatible with the XHTML DTDs but does not
represent the complete declarations from the XHTML DTD-->
<!ATTLIST input
type (text | password | checkbox | radio | submit | reset | file | hidden | image | button) "text"
>
Enumerated types also are used for certain attributes (such as the ismap attribute for img elements,
which can have only one value if enumerated types are used):
<!ATTLIST img
ismap (ismap) #IMPLIED>
If the img element should be treated as an image map, the document creator should use the ismap
attribute shown here:
<img src='whatever.png' ismap='ismap' />
If the image isn't a map, the img element shouldn't have an ismap attribute at all as shown here:
<img src='whatever.png' />
XHTML 1.0 doesn't use the NMTOKENS, NOTATION, ENTITY, or ENTITIES attribute types at all.
However, their use is not prohibited in XML DTDs that are designed to include or be included by
XHTML. If you encounter these types in a DTD you use with XHTML, consult the documentation for that
DTD regarding their proper use.
|