Thursday, 29 November 2007

XML Entity definitions for Characters

In many contexts people find it convenient to enter characters that are not on the keyboard as entity references, such as &rightarrow; to get an arrow rather than remembering what keyboard shortcut or numeric reference (&#x2192;) would produce this. In many cases, life would be simpler if people did not do this: having entity references means that not only do you need a <!DOCTYPE declaration to reference a DTD that defines the entities, you need your XML parser to read the DTD, and it makes processing fragments of XML much harder, as either the fragments do not have a <!DOCTYPE (in which case they are not, themselves, well formed) and the fragment pasting operation needs to ensure that a suitable DTD reference is placed on the target document, or the fragments do have a doctype, and the fragment pasting needs to strip this off, and still ensure that the target document has a compatible DTD.

If fragments are being moved from one place to another this can be difficult. Consider moving a fragent of MathML from an XHTML document to Docbook for example. XHTML and Docbook define entities with in several case the same name but different definition (the original ISO definitions of the entity names did not give definitions in terms of Unicode characters) and older versions of Unicode did not have sufficient technical symbols to give sensible definitions for most of these names.

All of which preamble is just leading up to say Unicode 5.1 (Beta) does have suitable characters for all the ISO and MathML entities..

The Entity draft at http://www.w3.org/2003/entities has thus been updated to a new "2007" version, which we (the W3C Math WG) hope to submit to W3C as a new Recommendation track document shortly, but you can view my Editor's Draft here.

MathML3 will hopefully use these by reference, if (X)HTML (and possibly other non-W3C systems such as Docbook) could do the same, then hopefully we would finally have a set of entity names with widespread consistent use across multiple languages. Hopefully.

Over the years I've been maintaining these sets we've kept fairly regular contact with the STIX group and the tables of characters in the above document include characters typeset with the STIX Fonts (if you have them installed, and they work in your browser. (The plane 1 characters still fail for me in all browsers on windows).

Comments are welcome, either in this blog, or better on www-math@w3.org.

1 comment:

John Cowan said...

However, if an entity is declared in the internal subset, an XML parser must understand and process it. One could provide IDE support that would notice which entities you actually use and insert their definitions into a mini-DTD of the document. Everybody wins, except of troglodytes like me who don't use IDE's, but I can cope.