Friday 14 December 2007

Three new W3C Working Drafts

Today the W3C published three new working drafts

MathML3 Several new features worked on this time, more Content MathML improvements, more information on layouts for elementary mathematics (long division, etc.), and the first draft of a Relax NG Schema. Also, we have re-instated the XHTML+MathML version of the spec.

A MathML for CSS profile I'm down as co-editor of this, but all the main credit should go to George Chavchanidze of Opera Software, who's continuing his long standing work of getting mathematical rendering using pure CSS declarations.

XML Entity definitions for Characters The latest iteration of the the definitions of Characters. This was formerly part of the MathML spec (Chapter 6) but had been separated out and extended to include all the ISO entity sets, and the HTML Entity sets. I still hope that eventually this can be a joint ISO/W3C publication, updating ISO/IEC TR 9773-13. We'll see...

Sunday 2 December 2007

Mathematics in PowerPoint 2007

I've never used PowerPoint, but I've been investigating recently the mathematical and in particular MathML capabilities of the Office 2007 suite.

It's been noted in several places that PowerPoint doesn't support the new Word 2007 math zones and that if you cut and paste a math expression from Word to PowerPoint, the result is an image, which means you can't edit it or search on it in that form, and it looks horrible on screen, especially if you have background colour or textures applied to your slides.

This note is just to mention a mechanism of getting correctly rendered editable mathematical text into PowerPoint, in a form which has the full oomml XML markup in the pptx file, so you can extract that and convert to MathML if needed using the Microsoft supplied stylesheet. I suspect that this mechanism is (or ought to be) well known by anyone (not me!) who's used PowerPoint, but a google search didn't show up anyone else mentioning it in this context, so I thought I'd post...

To get a math expression inserted, don't cut and paste a math zone from word, in Powerpoint choose
insert / Object /MicroSoft Word Document
then make a 'document' consisting of the equation you need using the embedded copy of Word. the resulting equation will be saved as an embedded object, and if you unzip the pptx PowerPoint file you will find a docx version of the embedded object in the embeddings directory, which you can further unzip to locate the oomml math XML.

The resulting equation renders as text rather than an image and may be edited at any time later, just click on it and you get thrown into a copy of Word.

Screen shot with one of Word's example equations rendered twice in a PowerPoint slide, once as an embedded object and once as an image.

Thursday 29 November 2007

XML Entity definitions for Characters

In many contexts people find it convenient to enter characters that are not on the keyboard as entity references, such as &rightarrow; to get an arrow rather than remembering what keyboard shortcut or numeric reference (&#x2192;) would produce this. In many cases, life would be simpler if people did not do this: having entity references means that not only do you need a <!DOCTYPE declaration to reference a DTD that defines the entities, you need your XML parser to read the DTD, and it makes processing fragments of XML much harder, as either the fragments do not have a <!DOCTYPE (in which case they are not, themselves, well formed) and the fragment pasting operation needs to ensure that a suitable DTD reference is placed on the target document, or the fragments do have a doctype, and the fragment pasting needs to strip this off, and still ensure that the target document has a compatible DTD.

If fragments are being moved from one place to another this can be difficult. Consider moving a fragent of MathML from an XHTML document to Docbook for example. XHTML and Docbook define entities with in several case the same name but different definition (the original ISO definitions of the entity names did not give definitions in terms of Unicode characters) and older versions of Unicode did not have sufficient technical symbols to give sensible definitions for most of these names.

All of which preamble is just leading up to say Unicode 5.1 (Beta) does have suitable characters for all the ISO and MathML entities..

The Entity draft at http://www.w3.org/2003/entities has thus been updated to a new "2007" version, which we (the W3C Math WG) hope to submit to W3C as a new Recommendation track document shortly, but you can view my Editor's Draft here.

MathML3 will hopefully use these by reference, if (X)HTML (and possibly other non-W3C systems such as Docbook) could do the same, then hopefully we would finally have a set of entity names with widespread consistent use across multiple languages. Hopefully.

Over the years I've been maintaining these sets we've kept fairly regular contact with the STIX group and the tables of characters in the above document include characters typeset with the STIX Fonts (if you have them installed, and they work in your browser. (The plane 1 characters still fail for me in all browsers on windows).

Comments are welcome, either in this blog, or better on www-math@w3.org.

Tuesday 27 November 2007

More STIX Experiments

Having posted a small test file to the stix comment page , I noticed that the stix site has a larger test file (choose "STIX Font Glyph Tables" from the menu options on the STIX site, which will take you to
http://www.stixfonts.org/allGlyphs.html.

This differs from my test (apart from being rather larger) in that each cell is individually assigned an appropriate font with separate CSS classes rather than having a single CSS font list and relying on the font choice to fall through fonts that do not have the appropriate glyph. The results are different but still a bit disappointing.

Three images of my Windows XP setup, one of the Font directory just to show they are there, one of the top of the allGlyphs file showing side by side in Firefox 2.0.0.9, Opera 9.21 and IE 7.0.5730.11, note that only FF actually shows the bold and italic variants. then the same three browsers showing the bottom of the file. here FF is all white, Opera is white for some characters and two missing glyph markers for others, and IE is mainly white but strangely enough actually renders the last few entries, which are monospace digits.

The images link to larger screen dumps which show (or rather do not show) the glyphs more clearly.

Monday 19 November 2007

Opera joins the party

http://my.opera.com/desktopteam/blog/2007/11/16/even-more-work

In this build, we added MathML support out of the box

Wednesday 7 November 2007

Stix fonts: Initial comments

After a wait of many years the STIX fonts have finally been released as public beta! These offer the promise of much better, more portable support for scientific documents on the web and elsewhere. The STIX fonts provide a uniform set of fonts that provide the glyphs for almost all the mathematical characters that have been added to unicode in recent years.

Initial testing suggests there may still be some problems with plane 1 characters (I just submitted the following comment to the stix beta test comment form...

I tested browser support (windows XP: Firefox, IE 7, Opera) with similar results in all three browsers. The quadruple integral displayed in all three browsers. The plane 1 alphabetic character did not display at all, FF displays 2 missing glyph boxes, IE shows 1 missing glyph box, and opera just shows white space..

I am not able to tell at this stage whether this is user error (I have not specified a sufficiently large set of the STIX fonts in CSS ?) or if this is due to lack of plane 1 support in the browsers, or if there is a problem with the unicode tables in the stix fonts, so I supply the small test file below, which I expect to render as as

QUADRUPLE INTEGRAL OPERATOR

MATHEMATICAL BOLD CAPITAL A

<html>
<head>
<title>stix</title>
<style>
p {
font-family: STIXGeneral, STIXGeneral-Italic, STIXGeneral-Bold, STIXGeneral-BoldItalic;
}
</style>
</head>
<body>
<p>x2a0c &#x2a0c;</p>
<p>x1d400 &#x1d400;</p>
</body>
</html>

Friday 5 October 2007

New MathML Draft

The W3C just published the latest draft of MathML 3. As ever, comments welcome on www-math@w3.org mailing list.

Tuesday 12 June 2007

XML position at NAG

NAG is looking for someone to work in the XML Technologies Group, that is, "my" group. So if you are interested, please contact the address specified in the above posting.

Tuesday 5 June 2007

The Big Switch

There was a thread a while ago on xsl-list discussing when was a good time to switch from XSLT1 to XSLT2. The consensus seemed to be, "it depends...."

For NAG, the answer is NOW!

After a lot of regression testing, staring at diff files and crossing of fingers, we just switched the processor for our main stylesheets over from saxon 6 to saxon 8.

This is quite a large (set of) stylesheets (around 26K lines of XSLT) so is quite a substantial test of XSLT2's compatibility with XSLT1. The results were pretty good, although it's not exactly surprising as an earlier version of these stylesheets were used to request improvements to backward compatibility mode as defined in an earlier draft, all the main changes requested in that report were made.

We are taking a multi-stage process to switching to version 2:

  1. Import an XSLT2 stylesheet that defines a few extension functions in the saxon 6 extension namespace using xsl:function and standard function calls. saxon:node-set defined as identity function, saxon:tokenize defined using XPath2 tokenize, saxon:distinct defined using XPath2 distinct-values etc.
  2. Process the stylesheets with saxon 8 over the NAG Library documentation for Fortran and C and check the results were the same.
  3. The only significant difference was due not so much to a change of language but to the change of processor. Saxon has changed its default ordering for xsl:sort. As documented, adding lang="en" reverted the behaviour.
  4. Start using XSLT2 features in the stylesheet, and start removing the existing calls to saxon6 extension functions. (This is where we are now.) This will eventually remove the need for the functions defined in step 1.
  5. (Some time soon.) Change the specified version on the stylesheets from 1.0 to 2.0. This will turn off BC mode and will require another round of regression testing. As noted in the old email above, the stylesheets often pass parameters to named templates that do not define those parameters, which will become an error. this will be easy to find as the error is fatal and so you just fix them. Harder will be finding all the places where XSLT1's implicit "first node in document order" semantics have been used. Fortunately in our case we have a large, but relatively stable document collection to process, so it's feasible to make this change and then run automated comparisons over the results of processing the entire collection in the two modes.

This is far from being the first use of XSLT2 at NAG, but it's the first time we've switched a stylesheet collection from version 1 to version 2, so far things have gone pretty smoothly...

Tuesday 29 May 2007

The EXSLT node-set function

Following on from a thread on xsl-list which turned to the question of support for the useful xx:node-set() extension function in the XSLT engines used by popular browsers.

Using XPath extension functions in a cross-browser environment is greatly simplified if all the XSLT engines use the same extension namespace. This was one of the motivating reasons behind the community initiative to standardise on the EXSLT extension namespaces.

Opera's XSLT engine supports exslt:node-set, Mozilla/Firefox's XSLT engine doesn't in the current release, but does in the "Gran Paradiso" alpha tests for Firefox 3. Internet Explorer (6 and 7) don't support EXSLT, but do support the functionally identical msxsl:node-set function in the usual msxsl extension namespace.

The usual way to handle exslt:node-set and msxsl:node-set in the same document is to use xsl:choose blocks, with tests on function-available('exslt:node-set') but that is often inconvenient if you want to use xx:node-set in the middle of an XPath.

In the above XSL-List thread I casually suggested that an alternative would be to just always use exslt:node-set in the body of the stylesheet and use the msxsl:script extension to define exslt:node-set for IE. That turned out not to be as easy as I thought as node-set isn't a valid function name in either of the supported extension languages in msxsl (JScript or VBScript). However Julian Reschke came up with the construct needed, use associative array syntax so you can use ['node-set'] to define the function. A complete stylesheet using this technique is shown below

<xsl:stylesheet
  version="1.0" 
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
  xmlns:exslt="http://exslt.org/common"
  xmlns:msxsl="urn:schemas-microsoft-com:xslt"
  exclude-result-prefixes="exslt msxsl">
  

<msxsl:script language="JScript" implements-prefix="exslt">
 this['node-set'] =  function (x) {
  return x;
  }
</msxsl:script>


<xsl:variable name="x">
  <y/>
</xsl:variable>

<xsl:template match="x">
  <html>
    <head><title>test exslt node set</title></head>
    <body>
      <xsl:apply-templates select="exslt:node-set($x)/*"/>
    </body>
  </html>
</xsl:template>

<xsl:template match="y">
  <p>node set!</p>
</xsl:template>

</xsl:stylesheet>

The same stylesheet is online together with a test file which just consists of a stylesheet reference and an empty element <x/>.

If your browser supports EXSLT (either natively or using the technique above) viewing the test file link should show "node set!". This appears to work in Firefox Gran paradiso, Internet Explorer 7, Opera 9.21. It doesn't work in Firefox 2 (you get an error message about an unsupported extension function). Users of other browsers that support XSLT (Safari?) feel free to report any results in the comments!

Updated 2008-08-06 to move test file to a different server.

Updated 2009-12-17 to move to google code server. The file now reports that Safari 3.2 supports node-set

Wednesday 9 May 2007

schematron updates

Ken Holman posted a problem with the skeleton which also applied to my version described in an earlier post so I updated the code, at the same URI. There have also been some updates in the last few days removing saxon dependencies, as described in the coments, thanks to Colin Adams for picking those up. [updated again 2007/05/09] A couple of errors had crept into the schematron-get-full-path mode used to display the XPath of the current node.

Sunday 29 April 2007

XHTML and MathML from OpenOffice.org 2.2

In the comments on Murray's blog I commented that (given a fixed stylesheet that does the conversion) the workflow to extract MathML from an XML file that contains MathML, and one that contains some XML transformable to MathML is pretty much the same.

This was in reply to a comment that Science (and it turns out, Nature) don't accept Office 2007 native format, partly at least due to a perceived lack of MathML support.

So, I thought I'd repeat my recent experiment Using OpenOffice.org rather than MS Office 2007.

I took the test1.docx file used as the test file in the earlier post, saved it (using Office 2007) as .doc (which warned me that the mathematics would get turned into pictures) and then imported it into OpenOffice.org, regenerated the mathematics using the formula editor provided, deleting the Word generated images, and saved as .odt. Now the aim of the game is to get an XHTML+MathML document, starting from here...

Step 1
Save as html. This produces a passable approximation to HTML (not valid but not quite so fanciful as the output from MS Word). The Mathematics is all images, and unlike Word there is no comment or other markup (not even an alt attribute) to give an alternative format. However there is a name attribute which gives the original object name, so we can retrieve the MathML from the .odt file.
Step 2
Unzip the odt file.
Step 3
Add an (empty) file math.dtd to several subdirectories extracted from the zip file, to make the content.xml files containing MathML well formed. (A catalogue would be a more sensible alternative to supplying multiple copies of the dtd).
Step 4
Run a small stylesheet that imports htmlparse.xsl to convert the HTML to XHTML, replaces the math images by their matching MathML fragments, and does a bit more cleanup on the xhtml to make it valid.
Step 5
There is no step 5:-)

The stylesheet in step 4 looks like this:

<xsl:stylesheet 
    version="2.0"
    xmlns="http://www.w3.org/1999/xhtml"
    xmlns:h="http://www.w3.org/1999/xhtml"
    xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:d="data:,dpc"> 
  <xsl:import href="../htmlparse/htmlparse.xsl"/>
  <xsl:template name="main">
    <xsl:processing-instruction name="xml-stylesheet"
     >type="text/xsl" href="pmathml.xsl"</xsl:processing-instruction>
    <xsl:apply-templates select="d:htmlparse(
     unparsed-text('test1oo.html','windows-1252'))"/>

  </xsl:template>
  <xsl:template match="@*|node()">
    <xsl:copy>
      <xsl:apply-templates select="@*|node()"/>
    </xsl:copy>
  </xsl:template>

  <xsl:template match="@align|@valign|@dir|@frame|@rules">
    <xsl:attribute name="{name()}" select="lower-case(.)"/>
  </xsl:template>

  <xsl:template match="h:img[starts-with(@name,'Object')]">
    <xsl:copy-of select="doc(concat('odt-unzip/Object ',
                             substring-after(@name,'Object'),
                             '/content.xml'))"/>
  </xsl:template>
</xsl:stylesheet>

The result of the transformation looks like this.

Comparing the stylesheet here with the one in the earlier post you'll see that the complexity in either case is about the same. Word writes all the information into a single file (as comments) which can be a bit more convenient, but OpenOffice.org places the MathML directly in the zip file which may also have advantages. Word's HTML output takes a lot more cleaning up to be valid HTML (or XHTML) but there are tools around that do that, more or less.

Rob Weir has a recent blog entry picking up on the MathML support as a reason to go with ODF. I don't buy the argument, to be honest. It's not surprising that publications are not set up to accept the zipped xml formats from any of these systems. Publisher's in house document processes are ususally somewhat complicated and fine tuned. Tooling up to accept any new format will take them time. There are plenty of reasons to argue over which format is better in any given situation, but a general statement that storing MathML is good, and storing something transformable to MathML is bad doesn't really convince me of anything, sorry!

Friday 27 April 2007

MathML 3 .0 and MathML for CSS

Two new working drafts issued today: MathML 3.0 and A MathML for CSS profile. These drafts are the first public working drafts by the current incarnation of the Math Working group. No doubt I will be posting more about those later, comments of course welcome either here or (better) on the www-math list.

Monday 23 April 2007

htmlparse, updated

There are two well known open source tools for cleaning up and fixing HTML to produce XML suitable for further processing, tidy and TagSoup.

Back in 2004 I posted some XSLT code, htmlparse.xsl, to do a similar job and used it in a few examples on the xsl-list.

When I needed such a tool for the MathML from Office post, my first thought was to use the htmlparse stylesheet. Unfortunately the output from Word was too much for the stylesheet, which couldn't cope with the Microsoft conditional declarations.

However it turned out not to be hard to extend the stylesheet, so I have updated htmlparse.xsl to fix a couple of bugs in attribute handling, and to extend it to cope with <![if !vml]> and friends. This now gives a pure XSLT route of getting from Word output to XHTML+MathML. Using the same test file as the earlier post, the stylesheet:

<xsl:stylesheet 
   version="2.0"
   xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
   xmlns:d="data:,dpc"> 
  <xsl:import href="../htmlparse/htmlparse.xsl"/>
  <xsl:import href="xhtml-mathml.xsl"/>
  <xsl:template name="main">
    <xsl:apply-templates select="d:htmlparse(
              replace(
                  unparsed-text('test1.htm','windows-1252'),
                  'http://www.w3.org/TR/REC-html40',
                  'http://www.w3.org/1999/xhtml'))"/>
  /xsl:template>
</xsl:stylesheet>

Produces an XHTML+MathML document as shown.

Although it's used here to convert an entire document, TagSoup, which interfaces to XSLT as a SAX parser, is probably preferable in this context. htmlparse (originally written just as an excercise in XSLT2 regular expressions) is probably more useful for parsing small fragments of html, as often found in query strings or embedded in CDATA in XML files. It has the advantage of being pure XSLT, not requiring any extension functions or non standard parser usage.

2009-12-18 updated location of htmlparse to google code

Wednesday 18 April 2007

Testing the emacs interface

Posting from emacs, editing existing post from emacs

MathML in HTML

As mentioned in an earlier post, I'm experimenting with Peter Jipsen's script for using MathML in posts served as text/html. The original script assumed a MathML enabled browser, I've just added some conditional code so that if Gecko or MathPlayer are not detected, some CSS, mainly due to George Chavchanidze, is inserted. It doesn't do all of presentation MathML yet but it renders fractions, msqrt and superscripts enough to make the above post more or less legible in Opera. Things need a bit more tuning yet, I seem to have introduced a noticeable delay before the mathematics is rendered (in all browsers), but progress, I think...

Saturday 14 April 2007

Schematron Experiments

Rick Jelliffe, acting as editor of ISO Schematron has a call for corrections and improvements to schematron, and also is working on updating the XSLT skeleton implementation, see this thread on the schematron-love list. This got me looking at schematron again, after a gap of some years...

The one suggestion that I made for a specification change is to allow schematron contexts to match text nodes. The current situation is very strange, schematron contexts (in XSLT1 or XSLT2 bindings) have (exactly) the syntax of XSLT patterns, but with an extra semantic restriction (not made explicitly in the specification) that any text nodes that are matched are ignored. The reference schematron implementation (current and previous version as far as I have seen) have implemented this inconsistently, applying schematron rules to text nodes (due to the fact that the default template for each pattern uses selects on node(), but not if the parent of the text node is the context of a schematron rule as then the recursion is via *|comment()|processing-instruction(). Rick has put this forward as an official change request to the working group, but seems unconvinced by the change, we'll see!

Meanwhile, looking at the code, I've not looked at the schematron implementation since I made the schematron-report (which google tells me was back in 1999, doesn't time fly:-) I think it's changed quite a bit since then, especially to accommodate the "skeleton" paradigm, but the basics are the same.

A trial implementation

There's been quite a bit of discussion on the schematron-love list about implementation changes, and posted code, but a mailing list isn't always the best place to discuss code samples, so I thought I'd try posting here.

Schematron allows you to specify several patterns and each rule has a context at which point several assertions can be made about XPath expressions. As noted above a context is more or less identical to an XSLT pattern, and the implementation of each rule in a pattern is as an XSLT template with a unique priority and a mode corresponding to the pattern to be used. For each pattern in the schematron schema, a full recursive tree walk is made via templates which process a node and then recursively process attributes and children. There are optimisation flags to try to speed this up, in particular by omitting attribute processing, and the specification that text nodes should not be visited, is partly due to a concern over this processing speed.

I argued on the list that visiting text nodes were not a serious time concern compared to other issues, and certainly not enough to distort the language. But for once I decided to heed Mike Kay's oft cited advice on xsl-list that performance questions should be answered by measuring timings.

I have made a modified schematron skeleton implementation that takes a number of parameters that control the way that the schematron patterns are implemented. Also available is a test document and test schema used in the test below.

The parameters controlling the generation of the XSLT are described below.

visit-text
This parameter defaults to 'false' at which setting text nodes are not visited when establishing the contexts for schematron patterns. This is the behaviour currently mandated by the specification. If it is set to 'true' text nodes are visited which is the behaviour but forward as a suggestion for a future schematron update. In some cases omitting text nodes is implemented by selecting *|comment()|processing-instruction() rather than node() (this does appear to save some time, but in other cases it is necessary to add a filter [not(self::text())] to explicitly skip text nodes, in which case not visiting text nodes may take extra time due to the extra check. Compare the times for 2 and 3 and 6 and 7 below. In all cases though on this test file the difference is only around 50ms in 1000, so not really significant enough to affect the language design.
attributes
Normally if recursing down the tree both the attribute and child nodes are selected, setting this to 'false' will cause the attribute to be omitted (the existing iso_schematron_skeleton has a similar parameter) one difference is that this defaults to 'true' if it is safe to do so (if no context in the schema has contains '@' or 'attribute').
only-child-elements
Similar to the attributes parameter above, this causes a test of node() to be changed to * so that just elements are visited on the child axis. Again this defaults to 'true' if it is safe to do so (no context contains a '(').
select-contexts
Perhaps the most radical option. Traditionally schematron has been implemented by a traditional walk over the tree where each step is accomplished by a template which process a node, and then recursively applies templates on child nodes and attributes. However this recursion is tail recursive and so is effectively an iteration, so an alternative implementation strategy is to first select all the nodes that need to be processed, and then process them with a non-recursive template that process a single node and does not call apply-templates.
This defaults to '', which enables the classic recursive behaviour.
If set to '//' then a an XPath is constructed from all the rule contexts in a pattern and used in a single select, so if a pattern has rules with contexts a b and c a select="//(a|b|c)" is generated. This method (on this test) leads to the quickest processor, but the difference over the default behaviour is not that great, and unfortunately it requires XSLT2 binding to allow the general step using (). To generate an XPath 1 expression would require too much parsing of the context than is really convenient in XSLT.
If set to 'key' the behaviour is similar to the behaviour with '//' but rather than use an XPath using // an XSLT key is generated. This has the advantage that it is much easier to generate legal XSLT1, and my intuition was that this would be the fastest so long as memory for the key did not get out of hand. It seems that my intuition here was wildly, spectacularly wrong. As can be seen from the timings below, runs 4 and 5, using this key setting take around 20 times longer.

One unrelated and largely cosmetic change, the schematron implementation needs to generate unique priorities for each template that it generates. The exact numbers do not matter, just the relative ordering. The existing ISO implementation counts down from 4000. the 4000 always bothered me (I don't know why, really) so I have changed it here to count up from 1000 (keeping the relative ordering). Strictly speaking, counting down from 4000 isn't safe as if you have 4000 templates the priorities specified would be less than the priorities specified on default templates in the same mode, however the real reason is cosmetic. (I was tempted to count up from 1 and lower the default templates more, but I stared from 1000, just to give them all 4 digit numbers.

Results for one test document

The following script was used, it's a bash script but can just be viewed as a sequence of command line instructions and equivalent code could be used in any operating system

Run 1 uses the current schematron beta skeleton, as a base comparison., runs 2 to 7 uses the modified implementation using various combinations of the parameters described above.

the schematron output (log1.txt ... log7.txt) is checked with the unix diff command: no output here is good, which means that the output is the same in each case. the -3 flag to saxon causes it to run the code three times and report the average time (thus avoiding measuring JVM startup and stylesheet compilation times).

#!/bin/bash

rm temp?.xsl
saxon8 -novw -o temp1.xsl dpc.sch
   iso_schematron_skeleton.xsl
saxon8 -novw -o temp2.xsl dpc.sch
   dpc_schematron_skeleton.xsl
saxon8 -novw -o temp3.xsl dpc.sch
   dpc_schematron_skeleton.xsl visit-text=true
saxon8 -novw -o temp4.xsl dpc.sch
   dpc_schematron_skeleton.xsl select-contexts=key
saxon8 -novw -o temp5.xsl dpc.sch
   dpc_schematron_skeleton.xsl visit-text=true
                               select-contexts=key
saxon8 -novw -o temp6.xsl dpc.sch
   dpc_schematron_skeleton.xsl select-contexts=//
saxon8 -novw -o temp7.xsl dpc.sch
   dpc_schematron_skeleton.xsl visit-text=true
                               select-contexts=//

rm log?.txt logg?.txt
saxon8 -3 -novw -o log1.txt book.xml temp1.xsl 2> logg1.txt
saxon8 -3 -novw -o log2.txt book.xml temp2.xsl 2> logg2.txt
saxon8 -3 -novw -o log3.txt book.xml temp3.xsl 2> logg3.txt
saxon8 -3 -novw -o log4.txt book.xml temp4.xsl 2> logg4.txt
saxon8 -3 -novw -o log5.txt book.xml temp5.xsl 2> logg5.txt
saxon8 -3 -novw -o log6.txt book.xml temp6.xsl 2> logg6.txt
saxon8 -3 -novw -o log7.txt book.xml temp7.xsl 2> logg7.txt

echo diff
diff log1.txt log2.txt
diff log1.txt log3.txt
diff log1.txt log4.txt
diff log1.txt log5.txt
diff log1.txt log6.txt
diff log1.txt log7.txt

echo grep
grep Average logg*txt

And the results are:

diff
grep
logg1.txt:*** Average execution time over 3 runs: 1078ms
logg2.txt:*** Average execution time over 3 runs: 1052ms
logg3.txt:*** Average execution time over 3 runs: 1062ms
logg4.txt:*** Average execution time over 3 runs: 21843ms
logg5.txt:*** Average execution time over 3 runs: 21885ms
logg6.txt:*** Average execution time over 3 runs: 1047ms
logg7.txt:*** Average execution time over 3 runs: 995ms

The default behaviour (2) of this script is a little faster than (1), probably because it visits fewer text nodes to more fully implement the current standard, When all text nodes are visited in run (3) it's slightly slower. As noted above (4) and (5) are unusable, unless this turns out to be a bug in either my code or the XSLT processor, this indicates that the 'key' option should be removed from any final release. 6 and 7 using // and not visiting or visiting text nodes turns out to be the fastest here, but probably not sufficiently different to make it worth having this as an option in a final release, unless perhaps other people with real documents who have found the classic recursive method slow report this as faster. (I suspect some documents will be faster, some slower, depending greatly on the documents and the underlying XSLT processor.

Licence

Most of the code in the modified skeleton is Copyright Rick Jelliffe, and the code is distributed under the same licence as his implementation of which this is just a minor modification. The comments at the top of the file say:

<!-- DPC
Modified schematron skeleton code by David Carlisle.
http://dpcarlisle.blogspot.com/search/label/schematron

All modifications are marked with XML comments starting
with " DPC".

The majority of the code is copyright
Rick Jelliffe and Academia Sinica Computing Center
distributed under the license below, see "LEGAL NOTICE".
The modified sections of code are the work of David Carlisle and are
distributed under the same licence. Rick Jelliffe or others are free to
incorporate David Carlisle's code back in to other schematron implementations
which use this licence, for any such incorporation, specific attribution is
not required, although the attribution style used for earlier contribution
is appreciated.

This is a test implementation for discussion on the schematron mailing list
suggesting some possible changes and enhancements for a schematron implementation.
People are welcome to try the code and comment, but for production use you are
strongly advised to use the "unofficial reference" implementation from schematron.org
This is not intended to be a long term "competing" implementation.
-->

Tuesday 10 April 2007

XHTML and MathML from Office 2007

All code and examples are now available from google code repository 2013/09/26

xhtml-mathml stylesheet updated 2007/05/09

I commented on Murray's blog That it ought to be possible to get XHTML/MathML documents out of Word. Having speculated that it ought be possible, conscience dictated that I try to do it. The results are described below.

The first problem was that I'd never used Word (or any other WYSIWYG editor). It seems strange to me, but then I've been using emacs so long I'm probably corrupted.

Word 2007 has MathML input/output (via an XSL stylesheet installed with the system), and has HTML input/output (via its save as web page file menu), so the plan of action is: save the document as html, clean it up to xhtml, using the stylesheet to convert the mathematics to MathML at the same time.

  1. Write your document in Word 2007, save as web page file.htm .
  2. Use tagsoup to get some usable XML from this output. java -jar tagsoup-1.1.jar --lexical --output-encoding=iso-8859-1 file.htm > temp.xml
  3. Use the supplied xhtml-mathml stylesheet to do some further cleanup and apply the Microsoft supplied omml2mml.xsl stylesheet to the math fragments. java -jar saxon8.jar -o file.xml temp.xml xhtml-mathml.xsl

Example:

  • Word document (docx)
  • "HTML" generated by Word
  • XHTML + MathML

Comments welcome!

Monday 9 April 2007

testing the interface

Just testing...

I probably won't be posting too much MathML here, but I wanted to check it was possible. Unfortunately I think blogger always serves as text/html, but Peter Jipsen's script seems to work well in this context. Currently it requires a Gecko browser such as Firefox, or IE with MathPlayer installed. Hopefully I can extend the script to allow some basic CSS rendering in other browsers later.

I also needed to check I could write posts in emacs of course!

Finally some examples using the MathML in HTML script. 12

x = - b ± b 2 - 4 a c 2 a

[update, continuous Relax NG validation of the mathml in xhtml in atom markup as I type into this emacs buffer. Isn't emacs wonderful. Will post schemas used shortly.]

Sunday 25 March 2007

Hello World

I thought I'd start a blog.....