Sunday, 29 April 2007

XHTML and MathML from OpenOffice.org 2.2

In the comments on Murray's blog I commented that (given a fixed stylesheet that does the conversion) the workflow to extract MathML from an XML file that contains MathML, and one that contains some XML transformable to MathML is pretty much the same.

This was in reply to a comment that Science (and it turns out, Nature) don't accept Office 2007 native format, partly at least due to a perceived lack of MathML support.

So, I thought I'd repeat my recent experiment Using OpenOffice.org rather than MS Office 2007.

I took the test1.docx file used as the test file in the earlier post, saved it (using Office 2007) as .doc (which warned me that the mathematics would get turned into pictures) and then imported it into OpenOffice.org, regenerated the mathematics using the formula editor provided, deleting the Word generated images, and saved as .odt. Now the aim of the game is to get an XHTML+MathML document, starting from here...

Step 1
Save as html. This produces a passable approximation to HTML (not valid but not quite so fanciful as the output from MS Word). The Mathematics is all images, and unlike Word there is no comment or other markup (not even an alt attribute) to give an alternative format. However there is a name attribute which gives the original object name, so we can retrieve the MathML from the .odt file.
Step 2
Unzip the odt file.
Step 3
Add an (empty) file math.dtd to several subdirectories extracted from the zip file, to make the content.xml files containing MathML well formed. (A catalogue would be a more sensible alternative to supplying multiple copies of the dtd).
Step 4
Run a small stylesheet that imports htmlparse.xsl to convert the HTML to XHTML, replaces the math images by their matching MathML fragments, and does a bit more cleanup on the xhtml to make it valid.
Step 5
There is no step 5:-)

The stylesheet in step 4 looks like this:

<xsl:stylesheet 
    version="2.0"
    xmlns="http://www.w3.org/1999/xhtml"
    xmlns:h="http://www.w3.org/1999/xhtml"
    xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:d="data:,dpc"> 
  <xsl:import href="../htmlparse/htmlparse.xsl"/>
  <xsl:template name="main">
    <xsl:processing-instruction name="xml-stylesheet"
     >type="text/xsl" href="pmathml.xsl"</xsl:processing-instruction>
    <xsl:apply-templates select="d:htmlparse(
     unparsed-text('test1oo.html','windows-1252'))"/>

  </xsl:template>
  <xsl:template match="@*|node()">
    <xsl:copy>
      <xsl:apply-templates select="@*|node()"/>
    </xsl:copy>
  </xsl:template>

  <xsl:template match="@align|@valign|@dir|@frame|@rules">
    <xsl:attribute name="{name()}" select="lower-case(.)"/>
  </xsl:template>

  <xsl:template match="h:img[starts-with(@name,'Object')]">
    <xsl:copy-of select="doc(concat('odt-unzip/Object ',
                             substring-after(@name,'Object'),
                             '/content.xml'))"/>
  </xsl:template>
</xsl:stylesheet>

The result of the transformation looks like this.

Comparing the stylesheet here with the one in the earlier post you'll see that the complexity in either case is about the same. Word writes all the information into a single file (as comments) which can be a bit more convenient, but OpenOffice.org places the MathML directly in the zip file which may also have advantages. Word's HTML output takes a lot more cleaning up to be valid HTML (or XHTML) but there are tools around that do that, more or less.

Rob Weir has a recent blog entry picking up on the MathML support as a reason to go with ODF. I don't buy the argument, to be honest. It's not surprising that publications are not set up to accept the zipped xml formats from any of these systems. Publisher's in house document processes are ususally somewhat complicated and fine tuned. Tooling up to accept any new format will take them time. There are plenty of reasons to argue over which format is better in any given situation, but a general statement that storing MathML is good, and storing something transformable to MathML is bad doesn't really convince me of anything, sorry!

4 comments:

Anonymous said...

Looking at your example with a browser that supports MathML we see:

Error loading stylesheet: An XSLT stylesheet does not have an XML mimetype:http://www.dcarlisle.demon.co.uk/omml2mml/pmathml.xsl

Aravind said...

Hello everyone

I have created an app which can convert an existing Word 2007 document into XHTML + MathML (all equations are converted to corresponding MathML). I am making the project open source, so any contributions are welcome.

http://www.codeplex.com/word2mathml

Aravind

Anonymous said...

What XSLT processor are you using? I'm using Saxon and htmlparse.xsl gives this error:

Error at xsl:variable on line 117 of file:/home/igor/htmlparse/htmlparse.xsl:

Error in expression '(\i\c*)\s*(=\s*("[^"]*"|''[^'']*''|\c+))?\s*': Unexpected token <literal> beyond end of expression

Transformation failed: Run-time errors were reported

David said...

That error message comes from saxon 6 (or at least I get that error if I use saxon 6).

Saxon 6 is an XSLT 1 processor and this is an XSLT 2 stylesheet, you need saxon 9.