Wednesday 15 October 2008

XML 1.0 Fifth Edition

The W3C is about to publish XML 1.0 5th edition, that is assuming that my (and others') objections are overruled.

This sets a really terrible precedence, and sadly puts XML into a similar state as HTML, where the specification will be widely ignored (as it will be inconsistent) and people will have no choice other than just to test against a collection of major implementations and do whatever they do. The position for HTML is so bad that the editor for HTML 5 is on record as saying that the HTML 4 specification is essentially irrelevant to HTML5, and HTML5 is instead based on a formalisation of existing implementation behaviour. XML was intended to move markup languages away from such "tag soup" and base everything on a well specified foundation.

XML 1.1 changed the rules for XML names (in a good way) allowing a very much more open set of characters to be used in XML names. However XML 1.1 has not had wide take-up, and so the XML core WG has decided to use subterfuge of changing the XML Recommendation in place by introducing a fake errata that changes the Name production.

There is an attempt to trivialise my and similar objections as process objections. Clearly it is a gross abuse of process but that is not the main point of the objection. It is a technical issue. 5th edition places every specification that refers to XML into a completely unspecified status. Do the features of the language use the original XML 1.0 production or the incompatible one in the 5th edition? I asked for a simple yes/no answer to the question of whether it would be conformant to use the new characters in xpath. It is clear from the reply that even members of the XML Core (and W3C TAG) groups can not say definitively whether a single xpath step using such a character is conformant or not. If Henry Thompson can't answer this, how can anyone expect a normal developer to know the answer? The issue is not restricted to XPath, the same lack of clarity surrounds simple questions as to whether IDs using such a character are valid in SVG, or DocBook, or any other language you care to name.

No doubt the development community will recover and make things work, but as I said above users will have to go by what the implementors do, they will no longer be able to go by the specifications, which is a shame, that might yet have bad consequences.

At the very least the TAG ought to update its finding on versioning strategies to explain how, if a user community shows some resistance to using a new version, a useful approach is to remove choice by making incompatible changes in place, but without changing either the major or minor version number.


Colin said...

One result of this terrible decision is that it will be impossible to claim conformance to any W3C recommendation anymore. I have been considering how to do this for Gestalt 1.1 (when it's ready).

The only formulation I could come up with is something like:

"conforms to the 127th edition of XSLT 2.0, as published on 23rd November 2203".

That way no one could prove me wrong until long after I'm dead. Of course, such a declaration is useless for users - they don't know where they stand.

So a practical consequence is that I will make no further attempts to conform to W3C recommendations, and do whatever I feel like.

Of course, XML 1.0 fifth edition is illegal (it's a straight lie to call it XML 1.0). That is, I believe a determined attempt to challenge it in the courts could yield money.

Alex Brown said...

David hi

Well said. This is potentially a very bad development for XML users.

Do you think it is possible to work around this (and other) problems with formal profiling: ?

And would you be interested in editing such a standard? :-)

- Alex.

Anonymous said...

I feel a bit like I'm contributing to one of the US election blogs---the ratio of heat to light is a bit off where it should be. Here's what David asked, and I answered, wrt XPath:

> Could you answer a simple yes/no
> question, as to whether on the day
> the 5th edition is published whether
> an xpath statement that uses xpath
> with a "new" Name is conformant to
> the W3C specifications or not?

> XPath2 references

> World Wide Web Consortium. Extensible
> Markup Language (XML) 1.0. (Third
> Edition) W3C Recommendation. See

Well, that's a problem, isn't it :-) ?
It _names_ a specific version, but
_points to_ the undated version. If
you take the name as definitive, then
it remains conformant. If you take the
pointer as definitive, it could be seen
as becoming non-conformant.

I say 'could be', because (and my
colleague Michael Sperberg-McQueen is
the expert on this, I'll try to channel
him on this), specs which (name and)
refer to undated versions should do so
in a way which allows implementations
to track the resulting changes in an
implementation-defined way, that is,
all they have to do to be conformant is
to identify which version they are
(currently) supporting. That makes
sense to me.

So, the reason I was equivocal was _not_ because the situation wrt XML 1.0 is unclear, or broken, but because there's a bug in the XPath 2 spec. I should have added that wrt XPath 1, the situation is perfectly clear: The normative definition of paths references the definition of names in the _undated_ version of XML Namespaces, which in turn references the definition of name characters in the _undated_ version of XML 1.0, so yes, if XML 1.0 5th edition is published as a REC, on that day new names become conformant in XPath 1.0 paths, and implementations may choose to upgrade.

David Carlisle said...

Henry, I think your answer is deeply troubling. You imply that for specifications that happened to have referenced a dated spec, it is clear that the original production applies.

So the answer to the question "can I use a new character in an ID" does, by your interpretation, depend on the fine print of the references section of the language being used.

Implementations are not going to do that. If you are using id() in xpath or similar functionality in other API, or simply just dtd or xsd validation, software is going to use the same xml parser for all xml languages, it is not going to switch in an xml 4th edn parser just because it has to validate IDs in a fooML document and fooML references the 4th (or 1st) edition.

So in practice all users will have to ignore the specifications and just test on an application by application basis to see what has been implemented. This is bad. It goes completely against the point of having a standard at all.

You say

on that day new names become conformant in XPath 1.0 paths, and implementations may choose to upgrade.

But that is not really an accurate description of the situation (as you interpret it). What your interpretation says is:

on that day new names become conformant and any Xpath 1 implementation that raises an error on such names is non conformant. There is no "choice" involved here, You have deliberately removed choice by making the change in place.

David Carlisle said...

. I should have added that wrt XPath 1, the situation is perfectly clear: The normative definition of paths references the definition of names in the _undated_ version of XML Namespaces, which in turn references the definition of name characters in the _undated_ version of XML 1.0,

Even here, things are not so clear. The reference to Name in the text of the namespace spec is linked to the undated XML rec, however the only normative reference to XML from XML namespaces is to the dated 4th edition.

I suppose you could say that again referencing the dated and undated version of xml 1.0 is a bug in the namespace spec, a bug shared with svg, xml canonicalisation, xpath 2, xslt, xhtml 1, ...

Would you agree that it is clear that any document using the new characters can not be used with XSD 1.0 as both parts 1 and 2 make normative references to the dated 2nd edition of xml 1.0?

Anonymous said...

I'm a total outsider to this discussion (came here from a Java programming blog), but the proposed change seems disturbing to me.

If I were working on an end user application using a XML 1.0 5th edition parser, I'd want to put a big bold warning up front in my user documentation: any XML 1.0 data this app produces may or may not work with applications targeted to XML 1.0 4th edition or earlier. Not fun. Vendors will have to make this qualification to their users and quite possibly put warning annotations in their data -- which is supposed to be the purpose of a version number in the first place.

I totally agree that XML should allow Unicode data in element and attribute names, but this is a strange way to do it. The reasoning I've read is basically "people aren't adopting XML 1.1, so we're going to make a change to XML 1.0". It seems disingenuous to me. I can't imagine such an incompatible parser change being retrofitted to, say, C99. If we need the features of XML 1.1, we should demand applications to use XML 1.1.

This doesn't affect me (yet), and other people have expressed my reservations in a better way, but I have to get it out there.

Liam Quin said...

Tom, note that none of this is about program-generated XML: your program knows what it's generating, and has no need for silly warnings: don't generate XML that uses non-ASCII characters in names or ID values and you'll be fine -- and if you are a native speaker of English and writing a program with XML as an interchange format, this is both the usual approach and very sensible.

As to whether ISO has made changes to specifications, of course they have; there may or may not have been retrospective changes to C99, I don't know -- sometimes such changes are published as "normative errata" and on at least one occasion for an ISO spec as a "technical corrgendum".

I personally hope we don't start releasing larger changes in this way; the change to the version number rules introduced in XML 5e may mean we could do an XML 1.2 at some point in the future and have it actually fly. But we aren't there today.

The actual impact on other specifications, including MathML, XSLT, XPath, XQuery, SVG and others, remains to be seen. Specs that have a fixed vocabulary, such as SVG, might be affected by ID values, although it's unlikely: for the most part, the people needing to use the new characters will at first know what they are doing, and by the time others catch up we hope for widespread deployment. Specs, like XQuery, that rely on XML names as part of their syntax, mostly also allowed the use of XML 1.1. And it turns out that some XSLT 1 processors at least already accepted the characters from XML 1.1. So pragmatically it may fly, even though everyone knows metal cylinders can't take off the ground :-)