Sunday 29 April 2007

XHTML and MathML from OpenOffice.org 2.2

In the comments on Murray's blog I commented that (given a fixed stylesheet that does the conversion) the workflow to extract MathML from an XML file that contains MathML, and one that contains some XML transformable to MathML is pretty much the same.

This was in reply to a comment that Science (and it turns out, Nature) don't accept Office 2007 native format, partly at least due to a perceived lack of MathML support.

So, I thought I'd repeat my recent experiment Using OpenOffice.org rather than MS Office 2007.

I took the test1.docx file used as the test file in the earlier post, saved it (using Office 2007) as .doc (which warned me that the mathematics would get turned into pictures) and then imported it into OpenOffice.org, regenerated the mathematics using the formula editor provided, deleting the Word generated images, and saved as .odt. Now the aim of the game is to get an XHTML+MathML document, starting from here...

Step 1
Save as html. This produces a passable approximation to HTML (not valid but not quite so fanciful as the output from MS Word). The Mathematics is all images, and unlike Word there is no comment or other markup (not even an alt attribute) to give an alternative format. However there is a name attribute which gives the original object name, so we can retrieve the MathML from the .odt file.
Step 2
Unzip the odt file.
Step 3
Add an (empty) file math.dtd to several subdirectories extracted from the zip file, to make the content.xml files containing MathML well formed. (A catalogue would be a more sensible alternative to supplying multiple copies of the dtd).
Step 4
Run a small stylesheet that imports htmlparse.xsl to convert the HTML to XHTML, replaces the math images by their matching MathML fragments, and does a bit more cleanup on the xhtml to make it valid.
Step 5
There is no step 5:-)

The stylesheet in step 4 looks like this:

<xsl:stylesheet 
    version="2.0"
    xmlns="http://www.w3.org/1999/xhtml"
    xmlns:h="http://www.w3.org/1999/xhtml"
    xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:d="data:,dpc"> 
  <xsl:import href="../htmlparse/htmlparse.xsl"/>
  <xsl:template name="main">
    <xsl:processing-instruction name="xml-stylesheet"
     >type="text/xsl" href="pmathml.xsl"</xsl:processing-instruction>
    <xsl:apply-templates select="d:htmlparse(
     unparsed-text('test1oo.html','windows-1252'))"/>

  </xsl:template>
  <xsl:template match="@*|node()">
    <xsl:copy>
      <xsl:apply-templates select="@*|node()"/>
    </xsl:copy>
  </xsl:template>

  <xsl:template match="@align|@valign|@dir|@frame|@rules">
    <xsl:attribute name="{name()}" select="lower-case(.)"/>
  </xsl:template>

  <xsl:template match="h:img[starts-with(@name,'Object')]">
    <xsl:copy-of select="doc(concat('odt-unzip/Object ',
                             substring-after(@name,'Object'),
                             '/content.xml'))"/>
  </xsl:template>
</xsl:stylesheet>

The result of the transformation looks like this.

Comparing the stylesheet here with the one in the earlier post you'll see that the complexity in either case is about the same. Word writes all the information into a single file (as comments) which can be a bit more convenient, but OpenOffice.org places the MathML directly in the zip file which may also have advantages. Word's HTML output takes a lot more cleaning up to be valid HTML (or XHTML) but there are tools around that do that, more or less.

Rob Weir has a recent blog entry picking up on the MathML support as a reason to go with ODF. I don't buy the argument, to be honest. It's not surprising that publications are not set up to accept the zipped xml formats from any of these systems. Publisher's in house document processes are ususally somewhat complicated and fine tuned. Tooling up to accept any new format will take them time. There are plenty of reasons to argue over which format is better in any given situation, but a general statement that storing MathML is good, and storing something transformable to MathML is bad doesn't really convince me of anything, sorry!

Friday 27 April 2007

MathML 3 .0 and MathML for CSS

Two new working drafts issued today: MathML 3.0 and A MathML for CSS profile. These drafts are the first public working drafts by the current incarnation of the Math Working group. No doubt I will be posting more about those later, comments of course welcome either here or (better) on the www-math list.

Monday 23 April 2007

htmlparse, updated

There are two well known open source tools for cleaning up and fixing HTML to produce XML suitable for further processing, tidy and TagSoup.

Back in 2004 I posted some XSLT code, htmlparse.xsl, to do a similar job and used it in a few examples on the xsl-list.

When I needed such a tool for the MathML from Office post, my first thought was to use the htmlparse stylesheet. Unfortunately the output from Word was too much for the stylesheet, which couldn't cope with the Microsoft conditional declarations.

However it turned out not to be hard to extend the stylesheet, so I have updated htmlparse.xsl to fix a couple of bugs in attribute handling, and to extend it to cope with <![if !vml]> and friends. This now gives a pure XSLT route of getting from Word output to XHTML+MathML. Using the same test file as the earlier post, the stylesheet:

<xsl:stylesheet 
   version="2.0"
   xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
   xmlns:d="data:,dpc"> 
  <xsl:import href="../htmlparse/htmlparse.xsl"/>
  <xsl:import href="xhtml-mathml.xsl"/>
  <xsl:template name="main">
    <xsl:apply-templates select="d:htmlparse(
              replace(
                  unparsed-text('test1.htm','windows-1252'),
                  'http://www.w3.org/TR/REC-html40',
                  'http://www.w3.org/1999/xhtml'))"/>
  /xsl:template>
</xsl:stylesheet>

Produces an XHTML+MathML document as shown.

Although it's used here to convert an entire document, TagSoup, which interfaces to XSLT as a SAX parser, is probably preferable in this context. htmlparse (originally written just as an excercise in XSLT2 regular expressions) is probably more useful for parsing small fragments of html, as often found in query strings or embedded in CDATA in XML files. It has the advantage of being pure XSLT, not requiring any extension functions or non standard parser usage.

2009-12-18 updated location of htmlparse to google code

Wednesday 18 April 2007

Testing the emacs interface

Posting from emacs, editing existing post from emacs

MathML in HTML

As mentioned in an earlier post, I'm experimenting with Peter Jipsen's script for using MathML in posts served as text/html. The original script assumed a MathML enabled browser, I've just added some conditional code so that if Gecko or MathPlayer are not detected, some CSS, mainly due to George Chavchanidze, is inserted. It doesn't do all of presentation MathML yet but it renders fractions, msqrt and superscripts enough to make the above post more or less legible in Opera. Things need a bit more tuning yet, I seem to have introduced a noticeable delay before the mathematics is rendered (in all browsers), but progress, I think...

Saturday 14 April 2007

Schematron Experiments

Rick Jelliffe, acting as editor of ISO Schematron has a call for corrections and improvements to schematron, and also is working on updating the XSLT skeleton implementation, see this thread on the schematron-love list. This got me looking at schematron again, after a gap of some years...

The one suggestion that I made for a specification change is to allow schematron contexts to match text nodes. The current situation is very strange, schematron contexts (in XSLT1 or XSLT2 bindings) have (exactly) the syntax of XSLT patterns, but with an extra semantic restriction (not made explicitly in the specification) that any text nodes that are matched are ignored. The reference schematron implementation (current and previous version as far as I have seen) have implemented this inconsistently, applying schematron rules to text nodes (due to the fact that the default template for each pattern uses selects on node(), but not if the parent of the text node is the context of a schematron rule as then the recursion is via *|comment()|processing-instruction(). Rick has put this forward as an official change request to the working group, but seems unconvinced by the change, we'll see!

Meanwhile, looking at the code, I've not looked at the schematron implementation since I made the schematron-report (which google tells me was back in 1999, doesn't time fly:-) I think it's changed quite a bit since then, especially to accommodate the "skeleton" paradigm, but the basics are the same.

A trial implementation

There's been quite a bit of discussion on the schematron-love list about implementation changes, and posted code, but a mailing list isn't always the best place to discuss code samples, so I thought I'd try posting here.

Schematron allows you to specify several patterns and each rule has a context at which point several assertions can be made about XPath expressions. As noted above a context is more or less identical to an XSLT pattern, and the implementation of each rule in a pattern is as an XSLT template with a unique priority and a mode corresponding to the pattern to be used. For each pattern in the schematron schema, a full recursive tree walk is made via templates which process a node and then recursively process attributes and children. There are optimisation flags to try to speed this up, in particular by omitting attribute processing, and the specification that text nodes should not be visited, is partly due to a concern over this processing speed.

I argued on the list that visiting text nodes were not a serious time concern compared to other issues, and certainly not enough to distort the language. But for once I decided to heed Mike Kay's oft cited advice on xsl-list that performance questions should be answered by measuring timings.

I have made a modified schematron skeleton implementation that takes a number of parameters that control the way that the schematron patterns are implemented. Also available is a test document and test schema used in the test below.

The parameters controlling the generation of the XSLT are described below.

visit-text
This parameter defaults to 'false' at which setting text nodes are not visited when establishing the contexts for schematron patterns. This is the behaviour currently mandated by the specification. If it is set to 'true' text nodes are visited which is the behaviour but forward as a suggestion for a future schematron update. In some cases omitting text nodes is implemented by selecting *|comment()|processing-instruction() rather than node() (this does appear to save some time, but in other cases it is necessary to add a filter [not(self::text())] to explicitly skip text nodes, in which case not visiting text nodes may take extra time due to the extra check. Compare the times for 2 and 3 and 6 and 7 below. In all cases though on this test file the difference is only around 50ms in 1000, so not really significant enough to affect the language design.
attributes
Normally if recursing down the tree both the attribute and child nodes are selected, setting this to 'false' will cause the attribute to be omitted (the existing iso_schematron_skeleton has a similar parameter) one difference is that this defaults to 'true' if it is safe to do so (if no context in the schema has contains '@' or 'attribute').
only-child-elements
Similar to the attributes parameter above, this causes a test of node() to be changed to * so that just elements are visited on the child axis. Again this defaults to 'true' if it is safe to do so (no context contains a '(').
select-contexts
Perhaps the most radical option. Traditionally schematron has been implemented by a traditional walk over the tree where each step is accomplished by a template which process a node, and then recursively applies templates on child nodes and attributes. However this recursion is tail recursive and so is effectively an iteration, so an alternative implementation strategy is to first select all the nodes that need to be processed, and then process them with a non-recursive template that process a single node and does not call apply-templates.
This defaults to '', which enables the classic recursive behaviour.
If set to '//' then a an XPath is constructed from all the rule contexts in a pattern and used in a single select, so if a pattern has rules with contexts a b and c a select="//(a|b|c)" is generated. This method (on this test) leads to the quickest processor, but the difference over the default behaviour is not that great, and unfortunately it requires XSLT2 binding to allow the general step using (). To generate an XPath 1 expression would require too much parsing of the context than is really convenient in XSLT.
If set to 'key' the behaviour is similar to the behaviour with '//' but rather than use an XPath using // an XSLT key is generated. This has the advantage that it is much easier to generate legal XSLT1, and my intuition was that this would be the fastest so long as memory for the key did not get out of hand. It seems that my intuition here was wildly, spectacularly wrong. As can be seen from the timings below, runs 4 and 5, using this key setting take around 20 times longer.

One unrelated and largely cosmetic change, the schematron implementation needs to generate unique priorities for each template that it generates. The exact numbers do not matter, just the relative ordering. The existing ISO implementation counts down from 4000. the 4000 always bothered me (I don't know why, really) so I have changed it here to count up from 1000 (keeping the relative ordering). Strictly speaking, counting down from 4000 isn't safe as if you have 4000 templates the priorities specified would be less than the priorities specified on default templates in the same mode, however the real reason is cosmetic. (I was tempted to count up from 1 and lower the default templates more, but I stared from 1000, just to give them all 4 digit numbers.

Results for one test document

The following script was used, it's a bash script but can just be viewed as a sequence of command line instructions and equivalent code could be used in any operating system

Run 1 uses the current schematron beta skeleton, as a base comparison., runs 2 to 7 uses the modified implementation using various combinations of the parameters described above.

the schematron output (log1.txt ... log7.txt) is checked with the unix diff command: no output here is good, which means that the output is the same in each case. the -3 flag to saxon causes it to run the code three times and report the average time (thus avoiding measuring JVM startup and stylesheet compilation times).

#!/bin/bash

rm temp?.xsl
saxon8 -novw -o temp1.xsl dpc.sch
   iso_schematron_skeleton.xsl
saxon8 -novw -o temp2.xsl dpc.sch
   dpc_schematron_skeleton.xsl
saxon8 -novw -o temp3.xsl dpc.sch
   dpc_schematron_skeleton.xsl visit-text=true
saxon8 -novw -o temp4.xsl dpc.sch
   dpc_schematron_skeleton.xsl select-contexts=key
saxon8 -novw -o temp5.xsl dpc.sch
   dpc_schematron_skeleton.xsl visit-text=true
                               select-contexts=key
saxon8 -novw -o temp6.xsl dpc.sch
   dpc_schematron_skeleton.xsl select-contexts=//
saxon8 -novw -o temp7.xsl dpc.sch
   dpc_schematron_skeleton.xsl visit-text=true
                               select-contexts=//

rm log?.txt logg?.txt
saxon8 -3 -novw -o log1.txt book.xml temp1.xsl 2> logg1.txt
saxon8 -3 -novw -o log2.txt book.xml temp2.xsl 2> logg2.txt
saxon8 -3 -novw -o log3.txt book.xml temp3.xsl 2> logg3.txt
saxon8 -3 -novw -o log4.txt book.xml temp4.xsl 2> logg4.txt
saxon8 -3 -novw -o log5.txt book.xml temp5.xsl 2> logg5.txt
saxon8 -3 -novw -o log6.txt book.xml temp6.xsl 2> logg6.txt
saxon8 -3 -novw -o log7.txt book.xml temp7.xsl 2> logg7.txt

echo diff
diff log1.txt log2.txt
diff log1.txt log3.txt
diff log1.txt log4.txt
diff log1.txt log5.txt
diff log1.txt log6.txt
diff log1.txt log7.txt

echo grep
grep Average logg*txt

And the results are:

diff
grep
logg1.txt:*** Average execution time over 3 runs: 1078ms
logg2.txt:*** Average execution time over 3 runs: 1052ms
logg3.txt:*** Average execution time over 3 runs: 1062ms
logg4.txt:*** Average execution time over 3 runs: 21843ms
logg5.txt:*** Average execution time over 3 runs: 21885ms
logg6.txt:*** Average execution time over 3 runs: 1047ms
logg7.txt:*** Average execution time over 3 runs: 995ms

The default behaviour (2) of this script is a little faster than (1), probably because it visits fewer text nodes to more fully implement the current standard, When all text nodes are visited in run (3) it's slightly slower. As noted above (4) and (5) are unusable, unless this turns out to be a bug in either my code or the XSLT processor, this indicates that the 'key' option should be removed from any final release. 6 and 7 using // and not visiting or visiting text nodes turns out to be the fastest here, but probably not sufficiently different to make it worth having this as an option in a final release, unless perhaps other people with real documents who have found the classic recursive method slow report this as faster. (I suspect some documents will be faster, some slower, depending greatly on the documents and the underlying XSLT processor.

Licence

Most of the code in the modified skeleton is Copyright Rick Jelliffe, and the code is distributed under the same licence as his implementation of which this is just a minor modification. The comments at the top of the file say:

<!-- DPC
Modified schematron skeleton code by David Carlisle.
http://dpcarlisle.blogspot.com/search/label/schematron

All modifications are marked with XML comments starting
with " DPC".

The majority of the code is copyright
Rick Jelliffe and Academia Sinica Computing Center
distributed under the license below, see "LEGAL NOTICE".
The modified sections of code are the work of David Carlisle and are
distributed under the same licence. Rick Jelliffe or others are free to
incorporate David Carlisle's code back in to other schematron implementations
which use this licence, for any such incorporation, specific attribution is
not required, although the attribution style used for earlier contribution
is appreciated.

This is a test implementation for discussion on the schematron mailing list
suggesting some possible changes and enhancements for a schematron implementation.
People are welcome to try the code and comment, but for production use you are
strongly advised to use the "unofficial reference" implementation from schematron.org
This is not intended to be a long term "competing" implementation.
-->

Tuesday 10 April 2007

XHTML and MathML from Office 2007

All code and examples are now available from google code repository 2013/09/26

xhtml-mathml stylesheet updated 2007/05/09

I commented on Murray's blog That it ought to be possible to get XHTML/MathML documents out of Word. Having speculated that it ought be possible, conscience dictated that I try to do it. The results are described below.

The first problem was that I'd never used Word (or any other WYSIWYG editor). It seems strange to me, but then I've been using emacs so long I'm probably corrupted.

Word 2007 has MathML input/output (via an XSL stylesheet installed with the system), and has HTML input/output (via its save as web page file menu), so the plan of action is: save the document as html, clean it up to xhtml, using the stylesheet to convert the mathematics to MathML at the same time.

  1. Write your document in Word 2007, save as web page file.htm .
  2. Use tagsoup to get some usable XML from this output. java -jar tagsoup-1.1.jar --lexical --output-encoding=iso-8859-1 file.htm > temp.xml
  3. Use the supplied xhtml-mathml stylesheet to do some further cleanup and apply the Microsoft supplied omml2mml.xsl stylesheet to the math fragments. java -jar saxon8.jar -o file.xml temp.xml xhtml-mathml.xsl

Example:

  • Word document (docx)
  • "HTML" generated by Word
  • XHTML + MathML

Comments welcome!

Monday 9 April 2007

testing the interface

Just testing...

I probably won't be posting too much MathML here, but I wanted to check it was possible. Unfortunately I think blogger always serves as text/html, but Peter Jipsen's script seems to work well in this context. Currently it requires a Gecko browser such as Firefox, or IE with MathPlayer installed. Hopefully I can extend the script to allow some basic CSS rendering in other browsers later.

I also needed to check I could write posts in emacs of course!

Finally some examples using the MathML in HTML script. 12

x = - b ± b 2 - 4 a c 2 a

[update, continuous Relax NG validation of the mathml in xhtml in atom markup as I type into this emacs buffer. Isn't emacs wonderful. Will post schemas used shortly.]