René Nyffenegger's collection of things on the web
René Nyffenegger on Oracle - Most wanted - Feedback -
 

XML Document

<?xml version="1.0"?>
<email>
  <sent timezone="UTC">12:54</sent>
  <from>
    <name>James Bond</name>
    <email-address>james.bond@secret-service.uk</email-address>
  </from>
  <to>
    <email-address>moneypenny@secret-service.uk</email-address>
  </to>
  <body>
    I'd like to take you out for a coffee.
  </body>
</email>
A document consists of the prolog and the root element.

The document prolog

XML Declaration

The XML declaration states some general properties about the document.
<?xml version="1.0"?>
The document prolog can set three properties:
  • version
  • encoding
    for example encoding="UTF-8"
  • standalone
    Possible values: yes and no. Yes: All DTD definitions are contained within the internal subset. No: There is an external subset. Caution: yes only stipulates that the DTD is contained, but there might still be external entities.

Document type declaration

Document type declaration is different from document type definition. The document type definition is embedded in the document type declaration.
<!DOCTYPE root-element
  uri-of-dtd
  [
    internal-subset
  ]>
The DOCTYPE 'statement' links an XML Document to a DTD.
The document type declaration names the root element type.
Additionally, it can point to a document type definition (DTD) (that governs the markup structure)

Document type definition

Document type definitions (or DTDs) define the structure of an XML Application.

Set of declarations

Some terminology

Root element

The all enclosing element (here: email) is called the Root element.
Also called the Document Element.

Internal subset

Document root

Document root is the very first (and essentially a virtual) element of the XML tree. This is DOM's representation. Document root is the parent of Document Element. Document element is the root element of the xml document. This is very first element of a xml document.
The root node can have other children besides the document element, e.g. comments and processing instructions.

Tags

Tags are delimitted by < and > (for example <from> is a tag or <email-address> is a tag.

Start Tags

End Tags

Empty Element Tag

<tag-name/>

Attributes

A tag might contain attributes. Each attribute name must be unique.
<tag-name attribute-one="value-one" further-attribute="further-value">
Attributes are used to
  • Uniquely name a tag (for further linking to the tag) (id and idref)
  • State properties about the tag.
The values of attributes can be restricted to a value domain, if needed (for example: January, February...., so that Eastern would be invalid)

Reserved attributes:

  • xml:lang

    Possible values: ISO 639, RFC 1766, a user defined language code (for example X-klingon) or ISO 3166 and some others.
  • xml:space
    Possible values: preserve and default
    Compare <pre> in HTML.
  • xml:link
  • xml:attribut

Types of Attributes:

  • ID
  • IDREF

Special attributes: actuate

Attribute values are subject to attribute-value normalization.

See also Elements vs Attributes, Using Elements and Attributes, When is an Attribute an Attribute and When to use attributes as opposed to elements.

Elements

Elements are enclosed between an opening and closing tag. That means, Tags divide the document's content into Elements. Elements might contain other Elements.

There are four types of elements:

  • Element content
    Must not contain anything except other Elements
  • Mixed content
    May contain Elements and characters.
  • Mixed content without other Elements
    Does not contain other Elements, but characters.
  • Empty Element
    Does not contain anythin.

Entities

An entity has a name and a value. The entity's name is a placeholder for content (determined by said value).

General entities

<!DOCTYPE some_root_elem SYSTEM "/home/ego/some.dtd"
[
  <!ENTITY entity-name "Some value to be inserted at the entity">
]
&entity-name;

Parameter entities

%entity-name;

Character entities

Character entities are predefined.
&amp; &lt; &gt; &apos; &quot;

Numbered character

Numbered character are predefined.
&#198; &#169; &#xad;
Is: Æ © ­

The hexadecimal representation (&#x) is useful for Unicode as well.

CDATA Sections

CDATA is character data, that is markup free data.
<![ CDATA [ some text, possibly having < and >  here ]]>

Processing Instructions (PI)

<?processing-instruction-name data for the processing instruction ?>
Not incidentally, a XML declaration looks like a processing instruction.

A use for processing instructions is to reference a stylesheet from within an XML document.

Comments

<!--  here is some comment
      further line -->