Friday 17 October 2008

DSDL: initial thoughts

Prompted by some questions of Rick Jelliffe on the schematron list, I've been looking at DSDL lately. This is a multipart ISO/IEC standard defining various aspects of XML validation. The parts are at various stages; from fully ratified international standard to (I think) no existing draft at all. Below is a summary of collected thoughts on the various parts.

1: Overview

No great surprises, just an index into the following parts...

2: Regular-grammar-based validation - RELAX NG

For anyone working with document oriented XML, Relax NG is the gold standard of schema languages. More expressive than either XSD or DTD, more naturally namespace aware and (especially in its compact syntax form) particularly easy to write.

3: Rule-based validation - Schematron

I think this is probably the oldest of the languages that were pulled together to form DSDL. It's stood the test of time. I've ever really had occasion to use it much but I have from time to time made experiments with schematron, including the first implementation of the html based "schematron-report" version, and somewhat more recently some experiments on this blog.

4: Namespace-based validation dispatching language - NVDL

NVDL, or something like it, should perhaps form the basis of a solving the perennial problem of forming "Compound Documents" such as including MathML in a host document type such as XHTML, or DocBook, or TEI, etc. Nagging doubts that it sometimes seems rather heavyweight just for validation, for any particular case (such as MathML in XHTML) it's usually possible to construct a compound schema directly, and as we found while looking at W3C CDF, in many contexts defining a compound schema is the easy part of defining a compound document format, the real problems lurk elsewhere defining the behaviour and inheritance of properties across the interface between the languages. Such questions as should the current font, or font size be inherited from the host document type to the embedded fragment.

However problems of property inheritance and event propagation are rightly out of scope for a validation language, so this really is just musing on a perennial problem the we have with MathML in (anything), which NVDL doesn't really address

5: Data Type Library Language - DTLL

I don't know, this is more or less the right thing, although I sent some minor comments re use of xpath2 (which the current draft avoids in favour of xpath 1).

But perhaps it's just too late. XSD datatypes are rather horrible and rather inflexible but they more or less do the job, most of the time, and even Relax NG users are by now in the habit of using xs:boolean and friends, as of course are Xpath 2 users. perhaps it is too late for this to ever gain traction, but perhaps not...

6: Path-based integrity constraints

This part appears to be on hold, with no public draft.

7: Character Repertoire Description Language - CRDL

CRDL, or CREPDL as it appears to be known now is what sparked my current interest in DSDL. specifically Rick Jelliffe asked on the schematron list for code to convert a crepdl specification into a schematron (which is effectively the same as converting it into one or more xpath expressions.

I sketched out a rough implementation in that thread, but actually I think that this is perhaps harder than it need be as CREPDL is using too powerful a technology to express character ranges. Regular expressions are highly efficient mechanism for specifying substrings, but CREPDL, as currently specified just really specifies single characters. A character repertoire is just a partition of the Unicode code range into three (characters that are definitely in, definitely out, or maybe in the repertoire.) However if regexp were not used, a different syntax would have to be invented for character ranges, and I don't have any good suggestions here, so perhaps using regex is OK, perhaps...

8: Document Schema Renaming Language - DSRL

It's difficult to know what to say about this section. WG1 recently published a Defect Report detailing some of the comments I'd raised on the public comment list. But really that list just scratches the surface. The specification as it stands is completely contradictory and unimplementable.

9: Datatype- and Namespace-aware DTDs

This seems to be technically sound, but doomed attempt to give a veneer of namespace respectability to DTD. Perhaps in 1998 this might have had a chance of taking off but now, post Relax NG, I can't see the point. DTD are not going to go away any time soon despite predictions in some quarters, at NAG for example we use DTD extensively, but if I want a namespace aware grammar language I'd use Relax NG every time rather than a DTD syntax with a collection of processing instructions giving namespace bindings.

10: Validation Management

This part appears to be on hold, with no public draft.

No comments: