XML Notes...

Mostly just notes from various books on XML, most noteably "Beginning XML 5th Edition", J Fawcett et al, Wrox Press.

Contents

  1. URL, URI or URN?
  2. Namespaces
  3. Document Type Definitions (DTDs)
  4. XPath

URL, URI or URN?

Summary...

More Detail...

The following StackOverflow thread gives many really good expanations. You can read the RFC here.

From the RFC an "identifier" is defined as follows:

An identifier embodies the information required to distinguish what is being identified from all other things within its scope of identification.

So how is a URL different from a URI. The RFC also explains that:

A URI can be further classified as a locator, a name, or both. The term "Uniform Resource Locator" (URL) refers to the subset of URIs that, in addition to identifying a resource, provide a means of locating the resource by describing its primary access mechanism (e.g., its network "location")

So, a locator is something that will provide a means of locating the resource. A URL is therefore an identifier and a locator, whereas a URI is an identifier, but not necessarily a locator.

I.e., URIs uniquely identify things but may not tell you how to find them. URLs are the subset of URIs that tell you how to find the objects identified.

And what about URNs?

The term "Uniform Resource Name" (URN) ... refer[s] to both URIs ... which are required to remain globally unique and persistent even when the resource ceases to exist or becomes unavailable, and to any other URI ...

So URNs are just URIs that may or may not persist even when the resource has ceased to exist. Kind of a permanent URI which is more heavily regulated, usually by IANA.

So, to summarise we could say that URLs both identify objects and tell you how to find them. URIs just identify objects, and URNs are just URIs that may persists through time.

Namespaces (xmlns)

Default

A way of grouping elements under a common heading in order to differentiate them from similarly named items. Usually use a URL, URN (or URI). Note that namespaces that look like URLs are not URLs, the URL string is just used as a guaranteably unqiue identifer.

Namespaces that look like URLs are not URLs: the URL string is just used as a guaranteably unqiue identifer.

To include a node and all children added it to the node using "xmlns=URI" as such:

<parent_node xmlns="URI">
   <child1> ... </child1>
   ...
   <childN> ... </childN>
</parent_node>

In the above parent_node and recursively, any of its children, are in the namespace identified by the URI.

This is call a default namespace and does not apply to attributes. Attribute namespaces must be serparately declared.

Default namespaces to not apply to attributes. Attribute namespaces must be serparately declared!

Explicit

Explicit namespace declaration needs a prefix to represent it (must not include colon or be reserved, e.g., "xmlns", "xml"...). Declare it using "xmlns:<tag-name>="URI""

Note that declaring a namespace tag does not associated it any nodes in the document.

<parent_node xmlns:mytag="URI">
   <child1> ... </child1>
   ...
   <childN> ... </childN>
</parent_node>

In the above a tag by which to refer to the namespace is decalred. The tag is called "mytag". As said, this just declares it. No nodes have yet been associated with it. To associate nodes with it use the following:

<mytag:parent_node xmlns:mytag="URI">
   <child1> ... </child1>
   ...
   <childN> ... </childN>
</mytag:parent_node>

The above associates the parent_node with the namespace identified by mytag, and all of its children. Therefore each child node is also associated with the "mytag" namespace. To put attributes in the name space the same "mytag:" prefix must be put before attribute names.

Document Type Definitions (DTDs)

Used to give rules for verifying the vocab of the document and its structure. Associate with a document internally by including inline with document or externally by referencing a separate DTD file.

Note: XML Schemas are the more "modern" way of defining what the structure of an XML document should look like!

DTDs ... [do] not offer all of the functionality of XML Schema ... DTDs have a unique syntax held over from SGML DTDs ... [and] are often criticized because of this need to learn a new syntax ...

Declare a DTD using...

<?xml version="1.0"?>
<!DOCTYPE name [
	... internal subset declarations ...
]>

Here name is the exact name of the root element in the XML doc. DTD declaration must be first line of doc (apart from XML ver). To use external subset declarations from another file use <!DOCTYPE name SYSTEM "URI" [ ]>.

Inside the DOCTYPE specification there are 3 basic parts:

Elements

Must declare each element that can appear in the document including any namespace prefix (a restriction that XML schemas overcome). Some elements can be tagged as required and others can be optional.

Declare using...

<!ELEMENT name (content-model)>
             ^     ^
             ^     Allowed child elements, text, a mixuture or empty
             ^
             Name of element as it appears in XML doc including namespace prefixes

The element content-model just specifies what can be a child of this element (named "name"). It can be a sequence, or multually-exclusive choice, of elements and any combination of these two.

Sequences: A sequence has defined order and looks like a tuple: a comma separated list of names: (name1, name2, ..., nameN). Any element with these children must contain all the children, no more and no less, and they must appear in the defined order.

Choices: A choice defines a multually-exclusive set of children: (name1 | name2 | ... | nameN). An element with these children may contain one or the other but not both.

Sequences and choices can be combined: e.g., (name1, (name2 | name3) that the node this appies to must have its first child element as name1 and then a second child element which must be either name2 or name3. It must have exactly two children in this order.

Text & Mixed Content: Mixed content is any content where text is allowed. To say that text is allowed just substiute #PCDATA in where you would have put an element name in the choice specification above. It stands for Parsed Character Data. The "Parsed&qout; part of the name implies that the interpretter will interpret any XML reserved characters found within. To specify mixed content you must use the choice mechanism and the #PCDATA must be the first item in the choice. Do it like this:

<!ELEMENT name (#PCDATA | el1 | ... | elN) *>

The * is used to say that the sequence can repeat zero or many times, like in a normal regular expression. Standard regexp operators *, +, and ? apply here.

Empty: <!ELEMENT name EMPTY> specifies the element named "name" may not have any children.

Any: <!ELEMENT name ANY> specifies the element named "name" may an content.

Attributes

Instead of declaring allowable content models for elements, you declare a list of allowable attributes for each element using ATTLIST declarations:

<!ELEMENT el-name (content-model)>
<!ATTLIST el-name attr_1-name attr_1-type ...
                  attr_2-name attr_2-type ...
                  ...
                  attr_N-name attr_N-type ...>

Attribute types:

TypeDescription
CDATA Character data, unparsed. Parser can ignore XML reserver characters.
ID Attribute value uniquely identifies containing element.
IDREF(S) IDREF indicates attribute value is a reference, by ID, to an element. IDREFS is a whitespace separated list of the former.
ENTITY Attr is reference to external unparsed entity
ENTITIES Whitespace separated list of ENTITY
NMTOKEN(S) Attribute is a name token: a string of character data. The plural is a whitespace separated list of the former.
Enumerated list List of possible values attribute may take specified using the same choice mechanism used for element declarations.

Default Value

<!ATTLIST element-name attr-name (attr1 | attr2 | ... | attrN) "attrM">

Will declare for the element named "element-name" an allowable attribute named "attr-name" that may have any of the values represented by "attr1" through "attrN". If no attribute is present a validating parser will add the default attribute specified in quotes at the end, ie. "attrM".

Fixed Value

#FIXED say's that an attribute's value can not change. Operate like default values. Validating parser will insert this if not found.

Required Value

#REQUIRED say's that an attribute is required and must be included in the XML doc

No Default Value (Implied)

<!ATTLIST element-name attr-name (attr1 | attr2 | ... | attrN) #IMPLIED>

#IMPLIED means attribute does not need to appear in element and has no default value. More specifically, it has no fixed value, no default and is not required.

XML Schemas (XSD)

Based on XML Schema Part 0: Primer Second Edition W3C Recommendation 28 October 2004.

Minimal document looks like:

<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema">
   <xsd:annotation>
      <xsd:documentation xml:lang="en">
         Some description of schema with copyright notice.
      </xsd:documentation>
   </xsd:annotation>
</xsd:schema>

Everything is prefixed with xsd:. After the annotation sections comes the structure that will define the grammar of any XML file based on this schema.

Either complex or simple types can be defined:

Also, its important to understand the difference between definitions and declarations:

Okay, so I have visitors coming and going and I want to save this traffic so that I know who is in the building at any one time. For each visitor I want to know:

  1. Their full name and title,
  2. Their company name and address,
  3. Who they have come to see,
  4. The time they arrived,
  5. The time they left,

There are some simple types here and some complex types. The name and title could just be a string, in which case it would be a simple type or it could be a complex type splitting up the name into entries for title, forename and surename, for example. Maybe, my log should look like this per visitor:

<visitor>
   <name>Mike Jordan</name>
   <company>
      <name>MBNA Lts.</name>
      <address>
         <number_or_name>Building 123</number_or_name>
         <street1>The Business Estate</street1>
         <street2>310 Business Road</street2>
         <postcode>BE345LT</postcode>
      </address>
   </company>
   <visited>Yvette Prieto</visited>
   <time_in>10:44 01/02/2017</time_in>
   <time_out>11:25 01/02/2017</time_out>
</visitor>

The point of an XSD spec is to ensure that whenever an XML entry is made in my log, it is done correctly. Information about the visitor should not be missed out and the contents of the fields should be sensible. For example, the time fields should give a time and date in a specific format.

So I can declare a visitor node in the document. This allows this node to occur once in my document:

<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema">
   <xsd:annotation>
      ...
   </xsd:annotaton>
   <xsd:element name="visitor" type="VisitorType"/>
 </xsd:schema>

Simple XSD Built-in Types

There are many simple types that are "built-in" to the XSD spec. To use these types just prefix them with "xsd:". Some of the most common you might use are...

  • string,
  • integer,
  • (positive|negative)Integer,
  • unsigned(Short|Int|Long), signed variants by using lower case of type without the "unsigned" prefix,
  • float, double,
  • dateTime, date, time,
  • duration

But, I must define what a VisitorType is. So underneath the declaration I also need to add the following defition:

<xsd:complexType name="VisitorType">
   <xsd:sequence>
      <xsd:element name="name" type="xsd:string"/>
      <xsd:element name="company" type="CompanyType"/>
      <xsd:element name="visited" type="xsd:string"/>
      <xsd:element name="time_in" type="xsd:dateTime"/>
      <xsd:element name="time_out" type="xsd:dateTime"/>
   </xsd:sequence>
</xsd:complexType>

Now we've defined what the VisitorType looks like. Most of the nodes residing inside the VisitorType are simple types. These are the ones where the type reads "xsd:... (see the side note to the right). But there is a complex type in there, the CompanyType. This still needs to be defined:

<xsd:complexType name="CompanyType">
   <xsd:sequence>
      <xsd:element name="name" type="xsd:string"/>
      <xsd:element name="address" type="AddressType"/>
   </xsd:sequence>
</xsd:complexType>

And this type also has an embedded complex type:

<xsd:complexType name="AddressType">
   <xsd:sequence>
      <xsd:element name="number_or_name" type="xsd:string"/>
      <xsd:element name="street1" type="xsd:string"/>
      <xsd:element name="street2" type="xsd:string"/>
      <xsd:element name="postcode" type="xsd:string"/>
   </xsd:sequence>
</xsd:complexType>

As you can see, we have to keep defining until we only have built in types.

XPath

Used to specify/select parts of XML documents of interest...

/Selects the document root node
/nodenameSelects the node named nodename that is a direct child of the root node
nodenameSelects the node named nodename that is a direct child of the current node
//Selects all descendent nodes of the current node
//nodename Selects all the nodes named nodename that are a descendent of the current node
.Selects current node
..Selects parent node
@Selects attribute nodes
nodename[predicate]Selects the node named nodename that matches the predicate. The predicate can be anything like:
[idx]Selects the idxth child named nodename of currnet node. Indicies start from 1.
[last()-i]Selects the last (minus i (optional)) child of the currnet node
[childnodename op value]Selects the last child of the current node that itself has a child node named childnodename which has a value op than/to value. Operator can be on of [| + - * div = != < > <= >= or and mod].
//nodename[@attrname=value]Selects any node named nodename that has an attribute named attrname with the value value.
*Matches any node
@*Matches any attribute

So for example you could specify a relative XPath like the one below.

testSet[1]/test[inputs/input[1]/@value='ival']

The above says to find all of the test nodes in the first testSet child of the current node, where the first input to the test has a property called "value" with the value "ival".

To select an attribute, rather than a node, you would do something like:

/path/to/node/@attribute-name

As another example, if you were searching for all nodes which can an attribute containing a particular string [Ref] you might use the following:

//*[@*[starts-with(., 'the-string-you're-looking-for')]]
^^^ ^^ ^^^^^^^^^^^ ^   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^ ^^ ^^^^^^^^^^^ ^  String attribute is compared with
^^^ ^^ ^^^^^^^^^^^ Pass the current attribute node(set) being considered
^^^ ^^ The starts-with function returns true if the first argument string starts
^^^ ^^   with the second argument string, and otherwise returns false.
^^^ Matches any attribute that matches the following predicate
Matches any node that matches the following predicate

Interestingly, as the SO user points out, the above is not the same as the following:

//*[starts-with(@*, 'the-string-you're-looking-for')]

The reason for this can be found in the W3C docs on string functions. From the docs we see the following:

The string function converts an object to a string as follows:

  • A node-set is converted to a string by returning the string-value of the node in the node-set that is first in document order. If the node-set is empty, an empty string is returned.
  • A number is converted to a string as follows:
    • NaN is converted to the string NaN
    • positive [and negative] zero is converted to the string 0
    • positive infinity is converted to the string Infinity
    • negative infinity is converted to the string -Infinity
    • if the number is an integer, the number is represented in decimal form as a Number with no decimal point and no leading zeros, preceded by a minus sign (-) if the number is negative
    • otherwise, the number is represented in decimal form as a Number including a decimal point with at least one digit before the decimal point and at least one digit after the decimal point, preceded by a minus sign (-) if the number is negative; there must be no leading zeros before the decimal point apart possibly from the one required digit immediately before the decimal point; beyond the one required digit after the decimal point there must be as many, but only as many, more digits as are needed to uniquely distinguish the number from all other IEEE 754 numeric values.
  • The boolean false value is converted to the string false. The boolean true value is converted to the string true.
  • An object of a type other than the four basic types is converted to a string in a way that is dependent on that type
  • If the argument is omitted, it defaults to a node-set with the context node as its only member.

Thus by passing the nodeset @* to starts-with() the function will only consider the first node because it's first argument is of type string, and so @* is converted to a string by returning the string-value of the node in the node-set: GOTCHA :)