Back in 2004 I posted some XSLT code, htmlparse.xsl, to do a similar job and used it in a few examples on the xsl-list.
When I needed such a tool for the MathML from Office post, my first thought was to use the htmlparse stylesheet. Unfortunately the output from Word was too much for the stylesheet, which couldn't cope with the Microsoft conditional declarations.
However it turned out not to be hard to extend the stylesheet, so I have updated htmlparse.xsl to fix a couple of bugs in attribute handling, and to extend it to cope with <![if !vml]> and friends. This now gives a pure XSLT route of getting from Word output to XHTML+MathML. Using the same test file as the earlier post, the stylesheet:
<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:d="data:,dpc"> <xsl:import href="../htmlparse/htmlparse.xsl"/> <xsl:import href="xhtml-mathml.xsl"/> <xsl:template name="main"> <xsl:apply-templates select="d:htmlparse( replace( unparsed-text('test1.htm','windows-1252'), 'http://www.w3.org/TR/REC-html40', 'http://www.w3.org/1999/xhtml'))"/> /xsl:template> </xsl:stylesheet>
Produces an XHTML+MathML document as shown.
Although it's used here to convert an entire document, TagSoup, which interfaces to XSLT as a SAX parser, is probably preferable in this context. htmlparse (originally written just as an excercise in XSLT2 regular expressions) is probably more useful for parsing small fragments of html, as often found in query strings or embedded in CDATA in XML files. It has the advantage of being pure XSLT, not requiring any extension functions or non standard parser usage.
2009-12-18 updated location of htmlparse to google code