XML Validation—Keeping Your Options Open

Axel Kramer, Draft Version 0.1 - 1999-05-03


Contents

  1. Benefits and challenges in validating XML documents
  2. What are the XML validation options?
  3. Where is XML validation technology going?
  4. What should one do now?
    1. Turn enumeration values into attributes
    2. Standardize name of sole attribute in empty element
    3. Turn values of primitive types into element content
    4. Split element if more than one primitive value
    5. Use ISO date and date-time format
    6. Create fixed type attribute for primitive types in DTD


^
1 Benefits and challenges in validating XML documents

One benefit of XML technology, which is often quoted by XML proponents, is that documents written in an XML-based language can be validated. The content of a valid XML document adheres to certain structural and data constraints. This differentiates a valid XML document from a well-formed XML document, which is only required to adhere to the XML syntax.

Why is it good to strive for validation of XML documents? An XML compute server could reject faulty and inconsistent documents, creators of documents can make sure their documents conform to the particular standard of the domain. User interface can make use of the validation description to offer input choices to the user. Aspects of this have been implemented before, without XML and without XML validation. The additional quality lies in the movement of the validation out of particular applications that handles the data. Independent of a particular application, one can say: yes, the content of this document is valid with respect to the standard published in this domain. Factoring validation out in this way makes applications slimmer and the validation code itself better maintainable and extensible.

The challenge is that there are various ways in which one can express "validity", and none of them can capture all the desirable semantic constraints on the data. There is a "progression" on what one can check with the two existing technologies, but those choices drive how one can represent the data. For a while to come there will be a split between validation rules which one can express in one of the standardized XML validation technologies, and the rules which have to be integrated in an application.

The purpose of this paper is to explore the options and limitations, and to give a rule of thumb for writing XML-based languages in the context of technologies available today so that one maximizes options going forward. This paper does not contain a discussion on extensibility of validation, nor does it talk about versioning of validation over time, both issues that might create yet another set of requirements for designers of XML based languages.


^
2 What are the XML validation options?

The following tables describe the options available today and present their advantages and disadvantages. There are two aspects: what one could validate, and when one could validate.

The first table is sorted by the amount of validation possible with a given approach.

 

Type

Technology

Current Limitations

1.

No Structure

none

 

2.

Structure, No Values

DTD, Schema

No local element types, thus no clean instance variable vs. object.

3.

Only Enumerations

DTD, Schema

Only for attribute-values.

4.

Primitive Datatypes

Schema

Only for element text content.

5.

Inheritance

Schema

 

6.

Semantic Validation

Programmatic

 
  Table 1. What to Validate

Orthogonal to these approaches is when validation happens.

 

Type

Technology

Current Limitiations

1.

Never

none

 

2.

Parse Integrated

Validating Parser

Schema validation not standardized yet.

3.

External & Signed

Programmatic & DOMHash

Modularized validation and element signatures not standardized yet.

  Table 2. When to Validate

^
3 Where is XML validation technology going?

The Microsoft XML parser for IE 5.0 and the corresponding 100% Java implementation from Datachannel implement a version of schema which was derived from the XML-Data and DCD proposals. The advantage of using it is that it actually works today, and that schemata defined with it can later be converted via XSL style-sheets into whichever schema standard is finally recommended by W3C.

W3C has a working group for schema technology. A requirements-document was recently published and a number of schema proposals have been submitted over the last year. I don't think that features available in the current MS schema implementation will vanish, they might be changed syntactically. I assume that beyond the inclusion of inheritance, there will be a drive to make data typing more consistent between elements and attributes. It is hard to say how much more "data-centric" the new schema proposal will be; it sometimes feels it is still stuck in document-land and DTD compatibility.

IBM's XML parser makes the validation module for the Java parser pluggable. This might enable an easier implementation of real semantic validation. They also made their DOMHash proposal available as an implementation to sign XML elements. This might be an interesting test-bed to combine semantic validation with signing of elements.


^
4 What should one do now?

One path one could follow is concentrate on semantic validation and signed elements, since that would cover the greatest space in a domain. That is a strategic development, which one could implement on a per domain-basis locally, as well as contribute to the W3C, so that schema technology discussed there enables such approach smoothly.

In the short term, one wants to make sure that the XML based languages that are being developed enable the maximum use of available validation in DTD's and MS schemata. Following the subsequent guidelines facilitates that goal. It enables one to use a DTD for describing structure and enumerations. In addition one can define a schema to check primitive types. Furthermore, the usage of a defaulted and fixed type attribute allows for dynamic checking of types in user interfaces.

^
  1. Turn enumeration values into attributes
  2. Standardize name of sole attribute in empty element
  3. Turn values of primitive types into element content
  4. Split element if more than one primitive value
  5. Use ISO date and date-time format
  6. Create fixed type attribute for primitive types in DTD

^
4.1 Turn enumeration values into attributes

Instead of using the element text content for values, that are members of an enumeration, use an attribute instead.

For example, instead of using:

	<Option>short </Option>

use:

	<Option value="short"/>

^
4.2 Standardize name of sole attribute in empty element

If the only value for an element is its attribute, give it, consistently, the name "value". If it is not the only value, give it a semantically interesting name.

For example, instead of using:

	<Option attr="short"/>
	<Money value="USD>10000</Money>

use:

	<Option value="short"/>
	<Money ccy="USD">10000</Money>

^
4.3 Turn values of primitive types into element content

If the value of an element is a primitive type, that is, an int, float, date, or string, turn it into element content.

For example, instead of using:

	<Money ccy="USD" amount="100000"/>

use:

	<Money ccy="USD">100000</Money>

^
4.4 Split element if more than one primitive value

If there are more than two values of primitive type in an element, think about turning them into sub-elements.

^
4.5 Use ISO date and date-time format

Do not base the date presentation on current legacy formats. Make sure to use the ISO standard for date representation.

For example, instead of using:

	<start>5 Jan 99</start>

use:

	<start>1999-01-05</start>

^
4.6 Create fixed type attribute for primitive types in DTD

For each element which has a primitive type as its text content, define a fixed defaulted attribute named: "type" which contains the type name.


Suggestions? Comments? Questions?

Please let us know what you think: info at 2far.com

 


Copyright (c) 1999-2006 Patricia Hallstein & Axel Kramer