Tuesday 10 April 2007

XHTML and MathML from Office 2007

All code and examples are now available from google code repository 2013/09/26

xhtml-mathml stylesheet updated 2007/05/09

I commented on Murray's blog That it ought to be possible to get XHTML/MathML documents out of Word. Having speculated that it ought be possible, conscience dictated that I try to do it. The results are described below.

The first problem was that I'd never used Word (or any other WYSIWYG editor). It seems strange to me, but then I've been using emacs so long I'm probably corrupted.

Word 2007 has MathML input/output (via an XSL stylesheet installed with the system), and has HTML input/output (via its save as web page file menu), so the plan of action is: save the document as html, clean it up to xhtml, using the stylesheet to convert the mathematics to MathML at the same time.

  1. Write your document in Word 2007, save as web page file.htm .
  2. Use tagsoup to get some usable XML from this output. java -jar tagsoup-1.1.jar --lexical --output-encoding=iso-8859-1 file.htm > temp.xml
  3. Use the supplied xhtml-mathml stylesheet to do some further cleanup and apply the Microsoft supplied omml2mml.xsl stylesheet to the math fragments. java -jar saxon8.jar -o file.xml temp.xml xhtml-mathml.xsl

Example:

  • Word document (docx)
  • "HTML" generated by Word
  • XHTML + MathML

Comments welcome!

40 comments:

Anonymous said...

It's pretty awesome work :)
But I've found at least one flaw with your tool: it does not properly convert double struck letters like the letter for the real number ℝ.

David Carlisle said...

If you make a file available I may (or may not:-) have a look (you'll find a working email for me easily enough from google)

However most likely the problem is not in my code. If you cut and paste the equation out of word, does the resulting MathML have the correct formatting? If not then the bug is in the omml2mml stylesheet supplied by Microsoft. My xhtml-mathml stylesheet
does include a couple of bug fixes for omml2mml but for obvious reasons I am not offering long term support for free to Microsoft. If their omml2mml stylesheet produces good MathML then the technique that I showed should produce a good xhtml+mathml document, but if it produces bad mathml then ideally you need to send a bug report to them. But sending it to me may be quicker/more effective, so long as you don't mind me warning in advance that I don't promise support at all...

Anonymous said...

Actually I don't know how cutting and pasting out of Word can result in MathML. I just get a linear representation as I'm using regular text editors or some free maths editors. With the classic "ℝ⊂ℂ⊂ℍ" formulae, I just have "R⊂C⊂H".

After investigating a little: in omml double struck letters are rendered with a regular letter and a property properly set. In mathml they are rendered with specific unicode characters.
So omml2mml stylesheet doesn't handle them as it should be.

Well, I'm better off using some other tools than Word 2007 and waiting for Microsoft to release a real MathML module for Office.

Anonymous said...

Oops, I did write a mistake. Actually, in MathML, ℝ can be input as [mi mathvariant="double-struck"]R[/mi], an equivalent form of what I saw with the generated omml. I suppose that you did know that ;)

Still, the property mathvariant isn't set by the stylesheet as one would expect.

David Carlisle said...

I don't know how you entered the expression, not knowing the word interface I just cut and pasted some mathml into word which internalsed it to omml and saved as docx

http://www.dcarlisle.demon.co.uk/omml2mml/R.docx

If you cut this to the clipboard it is saved as

<mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML"
xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math">
<mml:mi mathvariant="double-struck">R</mml:mi>
</mml:math>

which is fine, so (unfortunately for me) it means I can't pass this off as a bug in MS's omml2mml stylesheet. Once again it's the fact that the "omml" in the html comments isn't quite like the omml they actually save in the docx file. (I have no idea why they do that, but coping with that is what my stylesheet is supposed to do, I'll look into it)

Actually I don't know how cutting and pasting out of Word can result in MathML.

You can both cut mathml out of word, and paste it in but microsoft cunningly made this a user-option that they don't appear to document and hide 50 levels deep in the menus...

Navigate the ribbon thing to "equation tools" then there is a "Tools" section with a little diagonal arrow and clicking that pops up an "Equation Options" menu and 4 or so items down there there is a checkbox "copy MathML to the clipboard as plain text" which you want checked.

Oh by the way, it's not a requirement of the site, but having a conversation with anonymous feels odd, do you have a name?:-)

David Carlisle said...

My stylesheet's correction for the "HTML comment" version of omml wasn't quite good enough, 3 or 4 line change fixed things and also fixed a bug in the test1.xml given in the original post where sin and cos were coming out italic instead of roman.

The R.docx in the previous comment now generates

http://www.dcarlisle.demon.co.uk/omml2mml/R.xml

I've updated xhtml-mathml stylesheet and the test1.xml output file at the URI given in the main post.

Thanks for the bug report!

Xiny said...

You're welcome.

I didn't see I could use my google account to post comment, here is my name ;)

Wow ! Finally, it does work ! I can use the intuitive interface of Word to input my equation and take them off from Word as MathML, now with 2 convenient ways to do it. You made my day !

As a sidenote: to enter the double-struck R in Word, you have to navigate a little through the ribbons :)
In the "equation tools" ribbon, click on the 2nd down arrow of the "symbol" section to expand it. Then click on the arrow next to the "Basic Math" title and choose "Scripts". There you go, just click on this famous ℝ. I don't know if there's some input shortcut as in Mathematica ("dsR"). But Mathematica is too big (and too complex) for my little needs.

sidenote 2: it's a shame that Firefox doesn't render ℝ properly with mathvariant set to "double-struck". Well, I can file another bug to the FF dev team.

David Carlisle said...

> sidenote 2


They know about that, I meant to add some CSS so it works again, basically you just need:

mathvariant="double-struck"] {font-family: ... a font that has the glyphs and is likely to be installed...}

I expect filling in the font names is what stopped them putting this in th mathml.css that ships with firefox. I could however get my stylesheet to add a style element with some likely font families into the generated xhtml page head....

see also

http://groups.google.com/group/netscape.public.mozilla.mathml/browse_thread/thread/dc9f27cc7cc54120/d58ed2efd147cc54

David Carlisle said...

sorry blogger comment seemed to truncate long url, just google for
mathvariant css
and several suggested css rules will
show up...

Xiny said...

I didn't think of using css to patch misrendering. Thx for the tip. With some work, I'll finally be able to make browsers output some fancy maths equation without using pictures :)

Anonymous said...

According to firefox 2, an XSLT style sheet shouldn't provide an XML Mime type; namely, it refuses to display the page http://www.dcarlisle.demon.co.uk/omml2mml/test1.xml unless I remove the type="text/xsl" on the first line. Bug in your code or in Firefox ?

David Carlisle said...

Sorry about that, the code was correct but demon (the ISP hosting that example) changed the Mime type of .xsl files without warning, and have so far refused my requests that they change it back. I'm not pleased with them...

It's a hosting service bundled with the internet connection and doesn't allow end users to change the mime types. If I can't get them to fix the mime type of .xsl I'll have to switch the examples to anther host.

Anonymous said...

Thanks for sharing the method! I've used it to make some math texts accesible for screen readers and it worked like a charm, in conjunction with MathPlayer.

David Carlisle said...

Glad you got it working!

Anonymous said...

Is it possible to get the MathML of an equation in Word 2007 using Visual Studio Tools for Office?

David Carlisle said...

Is it possible to get the MathML of an equation in Word 2007 using Visual Studio Tools for Office?

I'd hope so. Although I haven't used VS tools for office myself. There's some code here to get the XML out of an office2007 package, and once you've got the xml, it should be easy enough to access the standard .NET XSLT engine to apply the transformation to MathML.

http://msdn2.microsoft.com/en-us/library/bb497448.aspx

Anonymous said...

Hi David,

Thanks. I wouldnt have even known where to look without this help..

I used the WordOpenXML property, and that works pretty much like you said.It fetches all the XML of the document though..

There does not seem to be an XMLPart corresponding to the equations in a Word document.. I hope I am not missing anything here.But I should still be able to work with what I have.

David Carlisle said...

The equations are not in a separate file, they are part of the main document run.

In the docx zip there's a directory "word" containing a an xml file document.xml
the equations are to be found in
m:oMathPara and m:oMath elements,
which you can extract easiliy enough with some Xpath/XSLT code once you have access to the document.xml

Aravind said...

Hello everyone

I have created an app which can convert an existing Word 2007 document into XHTML + MathML (all equations are converted to corresponding MathML). I am making the project open source, so any contributions are welcome.

http://www.codeplex.com/word2mathml

Aravind

Tom Holden said...

Any change you could update this for the new version of the omml2mml.xsl ( http://blogs.msdn.com/murrays/archive/2008/07/28/improved-mathml-support-in-word-2007.aspx ) and ideally for tagsoup 1.2 and saxon 9.

At the moment I'm getting xml errors inside your file, I'm not sure what's causing it exactly.

David Carlisle said...

Yes, I'll see what I can do...

Tom Holden said...

Great thanks.

Anonymous said...
This comment has been removed by a blog administrator.
Unknown said...
This comment has been removed by a blog administrator.
rukami said...
This comment has been removed by a blog administrator.
vseopricheske said...
This comment has been removed by a blog administrator.
devochkam said...
This comment has been removed by a blog administrator.
madonna said...
This comment has been removed by a blog administrator.
madonna said...
This comment has been removed by a blog administrator.
Alex said...
This comment has been removed by a blog administrator.
LightRook said...

What are you using for XLST?
With Xalan, I'm getting:

XSLT warning: xsl:output has an unknown method '{null}'. (Occurred in entity '/home/atondwal/xhtml-mathml.xsl', at line 26, column 26.)
XPath error: Use 'self::node()[predicate]' instead of '.[predicate]'.
expression = '.[namespace-uri(.)=('')]' Remaining tokens are: ( '[' 'namespace-uri' '(' '.' ')' '=' '(' '''' ')' ']') (Occurred in entity '/home/atondwal/xhtml-mathml.xsl', at line 55, column 55.)

David Carlisle said...

You need XSLT2, as I note in the article I use saxon (saxon 8 back then, 9 now). It's a _long_ while since I have used XSLT 1 except in a browser.

Unknown said...

Looks great in safari, except for a few minor problems with the display of parentheses, but when I use the latest version of firefox to load , which supports mml, to load http://www.dcarlisle.demon.co.uk/omml2mml/test1.xml, I get the following:
Error loading stylesheet: An XSLT stylesheet does not have an XML mimetype:http://www.dcarlisle.demon.co.uk/omml2mml/pmathml.xsl

Unknown said...

.NET supports XSLT 1.0 only. Can your stylesheet be modified so that it doesn't require XSLT 2.0? I am getting .NET errors while using your stylesheet because of lack of XSLT 2.0 support by .NET.
Thank you..Saf

David Carlisle said...

It will be possible (but I don't plan to do it:-) XSLT 2 has been out for many years (and XSLT 3 is nearly finalised). Saxon provides a very conformant XSLT 2 processor for .NET.

David Carlisle said...

Susan yes sorry that ISP changed the mime types of the files, I should move them really, the test xml is there and should work if you save it locally.

Mahesh said...

Hello David,

I'm getting "Error loading stylesheet: An XSLT stylesheet does not have an XML mimetype:http://www.dcarlisle.demon.co.uk/omml2mml/pmathml.xsl" while viewing the XHTML+MATHML page.

and i'm getting xslt compilation exception on your "xhtml-mathml.xsl".

Please help me.

Thanks,
Mahesh

David Carlisle said...

The files have all been moved to googlecode, see updated posting.

Anonymous said...

Hi David, Whenever I save word document having math equation it stores as a image, I dont get below content in my .htm file, which is input file for java programme you suggested. Please help.
I had to remove all tags as It was not allowing me to comment.

"p class=MsoNormal !--[if gte msEquation 12] m:oMathPara m:oMath i
style='mso-bidi-font-style:normal' span style='font-family:"Cambria Math","serif"'>m:r f /m:r /span /i m:d m:dPr span
"

mjz said...

Dear David,

I saw that you used omml2mml.xsl stylesheet in your project. I want to use omml2mml.xsl and mml2omml.xsl files in an open source project. I am looking for their licence. I wrote to some guys in Microsoft but nobody answer.

Could you please share with me any information if you have about their licence?

Thanks in advance,
Mahdi