Saturday, 14 April 2007

Schematron Experiments

Rick Jelliffe, acting as editor of ISO Schematron has a call for corrections and improvements to schematron, and also is working on updating the XSLT skeleton implementation, see this thread on the schematron-love list. This got me looking at schematron again, after a gap of some years...

The one suggestion that I made for a specification change is to allow schematron contexts to match text nodes. The current situation is very strange, schematron contexts (in XSLT1 or XSLT2 bindings) have (exactly) the syntax of XSLT patterns, but with an extra semantic restriction (not made explicitly in the specification) that any text nodes that are matched are ignored. The reference schematron implementation (current and previous version as far as I have seen) have implemented this inconsistently, applying schematron rules to text nodes (due to the fact that the default template for each pattern uses selects on node(), but not if the parent of the text node is the context of a schematron rule as then the recursion is via *|comment()|processing-instruction(). Rick has put this forward as an official change request to the working group, but seems unconvinced by the change, we'll see!

Meanwhile, looking at the code, I've not looked at the schematron implementation since I made the schematron-report (which google tells me was back in 1999, doesn't time fly:-) I think it's changed quite a bit since then, especially to accommodate the "skeleton" paradigm, but the basics are the same.

A trial implementation

There's been quite a bit of discussion on the schematron-love list about implementation changes, and posted code, but a mailing list isn't always the best place to discuss code samples, so I thought I'd try posting here.

Schematron allows you to specify several patterns and each rule has a context at which point several assertions can be made about XPath expressions. As noted above a context is more or less identical to an XSLT pattern, and the implementation of each rule in a pattern is as an XSLT template with a unique priority and a mode corresponding to the pattern to be used. For each pattern in the schematron schema, a full recursive tree walk is made via templates which process a node and then recursively process attributes and children. There are optimisation flags to try to speed this up, in particular by omitting attribute processing, and the specification that text nodes should not be visited, is partly due to a concern over this processing speed.

I argued on the list that visiting text nodes were not a serious time concern compared to other issues, and certainly not enough to distort the language. But for once I decided to heed Mike Kay's oft cited advice on xsl-list that performance questions should be answered by measuring timings.

I have made a modified schematron skeleton implementation that takes a number of parameters that control the way that the schematron patterns are implemented. Also available is a test document and test schema used in the test below.

The parameters controlling the generation of the XSLT are described below.

visit-text
This parameter defaults to 'false' at which setting text nodes are not visited when establishing the contexts for schematron patterns. This is the behaviour currently mandated by the specification. If it is set to 'true' text nodes are visited which is the behaviour but forward as a suggestion for a future schematron update. In some cases omitting text nodes is implemented by selecting *|comment()|processing-instruction() rather than node() (this does appear to save some time, but in other cases it is necessary to add a filter [not(self::text())] to explicitly skip text nodes, in which case not visiting text nodes may take extra time due to the extra check. Compare the times for 2 and 3 and 6 and 7 below. In all cases though on this test file the difference is only around 50ms in 1000, so not really significant enough to affect the language design.
attributes
Normally if recursing down the tree both the attribute and child nodes are selected, setting this to 'false' will cause the attribute to be omitted (the existing iso_schematron_skeleton has a similar parameter) one difference is that this defaults to 'true' if it is safe to do so (if no context in the schema has contains '@' or 'attribute').
only-child-elements
Similar to the attributes parameter above, this causes a test of node() to be changed to * so that just elements are visited on the child axis. Again this defaults to 'true' if it is safe to do so (no context contains a '(').
select-contexts
Perhaps the most radical option. Traditionally schematron has been implemented by a traditional walk over the tree where each step is accomplished by a template which process a node, and then recursively applies templates on child nodes and attributes. However this recursion is tail recursive and so is effectively an iteration, so an alternative implementation strategy is to first select all the nodes that need to be processed, and then process them with a non-recursive template that process a single node and does not call apply-templates.
This defaults to '', which enables the classic recursive behaviour.
If set to '//' then a an XPath is constructed from all the rule contexts in a pattern and used in a single select, so if a pattern has rules with contexts a b and c a select="//(a|b|c)" is generated. This method (on this test) leads to the quickest processor, but the difference over the default behaviour is not that great, and unfortunately it requires XSLT2 binding to allow the general step using (). To generate an XPath 1 expression would require too much parsing of the context than is really convenient in XSLT.
If set to 'key' the behaviour is similar to the behaviour with '//' but rather than use an XPath using // an XSLT key is generated. This has the advantage that it is much easier to generate legal XSLT1, and my intuition was that this would be the fastest so long as memory for the key did not get out of hand. It seems that my intuition here was wildly, spectacularly wrong. As can be seen from the timings below, runs 4 and 5, using this key setting take around 20 times longer.

One unrelated and largely cosmetic change, the schematron implementation needs to generate unique priorities for each template that it generates. The exact numbers do not matter, just the relative ordering. The existing ISO implementation counts down from 4000. the 4000 always bothered me (I don't know why, really) so I have changed it here to count up from 1000 (keeping the relative ordering). Strictly speaking, counting down from 4000 isn't safe as if you have 4000 templates the priorities specified would be less than the priorities specified on default templates in the same mode, however the real reason is cosmetic. (I was tempted to count up from 1 and lower the default templates more, but I stared from 1000, just to give them all 4 digit numbers.

Results for one test document

The following script was used, it's a bash script but can just be viewed as a sequence of command line instructions and equivalent code could be used in any operating system

Run 1 uses the current schematron beta skeleton, as a base comparison., runs 2 to 7 uses the modified implementation using various combinations of the parameters described above.

the schematron output (log1.txt ... log7.txt) is checked with the unix diff command: no output here is good, which means that the output is the same in each case. the -3 flag to saxon causes it to run the code three times and report the average time (thus avoiding measuring JVM startup and stylesheet compilation times).

#!/bin/bash

rm temp?.xsl
saxon8 -novw -o temp1.xsl dpc.sch
   iso_schematron_skeleton.xsl
saxon8 -novw -o temp2.xsl dpc.sch
   dpc_schematron_skeleton.xsl
saxon8 -novw -o temp3.xsl dpc.sch
   dpc_schematron_skeleton.xsl visit-text=true
saxon8 -novw -o temp4.xsl dpc.sch
   dpc_schematron_skeleton.xsl select-contexts=key
saxon8 -novw -o temp5.xsl dpc.sch
   dpc_schematron_skeleton.xsl visit-text=true
                               select-contexts=key
saxon8 -novw -o temp6.xsl dpc.sch
   dpc_schematron_skeleton.xsl select-contexts=//
saxon8 -novw -o temp7.xsl dpc.sch
   dpc_schematron_skeleton.xsl visit-text=true
                               select-contexts=//

rm log?.txt logg?.txt
saxon8 -3 -novw -o log1.txt book.xml temp1.xsl 2> logg1.txt
saxon8 -3 -novw -o log2.txt book.xml temp2.xsl 2> logg2.txt
saxon8 -3 -novw -o log3.txt book.xml temp3.xsl 2> logg3.txt
saxon8 -3 -novw -o log4.txt book.xml temp4.xsl 2> logg4.txt
saxon8 -3 -novw -o log5.txt book.xml temp5.xsl 2> logg5.txt
saxon8 -3 -novw -o log6.txt book.xml temp6.xsl 2> logg6.txt
saxon8 -3 -novw -o log7.txt book.xml temp7.xsl 2> logg7.txt

echo diff
diff log1.txt log2.txt
diff log1.txt log3.txt
diff log1.txt log4.txt
diff log1.txt log5.txt
diff log1.txt log6.txt
diff log1.txt log7.txt

echo grep
grep Average logg*txt

And the results are:

diff
grep
logg1.txt:*** Average execution time over 3 runs: 1078ms
logg2.txt:*** Average execution time over 3 runs: 1052ms
logg3.txt:*** Average execution time over 3 runs: 1062ms
logg4.txt:*** Average execution time over 3 runs: 21843ms
logg5.txt:*** Average execution time over 3 runs: 21885ms
logg6.txt:*** Average execution time over 3 runs: 1047ms
logg7.txt:*** Average execution time over 3 runs: 995ms

The default behaviour (2) of this script is a little faster than (1), probably because it visits fewer text nodes to more fully implement the current standard, When all text nodes are visited in run (3) it's slightly slower. As noted above (4) and (5) are unusable, unless this turns out to be a bug in either my code or the XSLT processor, this indicates that the 'key' option should be removed from any final release. 6 and 7 using // and not visiting or visiting text nodes turns out to be the fastest here, but probably not sufficiently different to make it worth having this as an option in a final release, unless perhaps other people with real documents who have found the classic recursive method slow report this as faster. (I suspect some documents will be faster, some slower, depending greatly on the documents and the underlying XSLT processor.

Licence

Most of the code in the modified skeleton is Copyright Rick Jelliffe, and the code is distributed under the same licence as his implementation of which this is just a minor modification. The comments at the top of the file say:

<!-- DPC
Modified schematron skeleton code by David Carlisle.
http://dpcarlisle.blogspot.com/search/label/schematron

All modifications are marked with XML comments starting
with " DPC".

The majority of the code is copyright
Rick Jelliffe and Academia Sinica Computing Center
distributed under the license below, see "LEGAL NOTICE".
The modified sections of code are the work of David Carlisle and are
distributed under the same licence. Rick Jelliffe or others are free to
incorporate David Carlisle's code back in to other schematron implementations
which use this licence, for any such incorporation, specific attribution is
not required, although the attribution style used for earlier contribution
is appreciated.

This is a test implementation for discussion on the schematron mailing list
suggesting some possible changes and enhancements for a schematron implementation.
People are welcome to try the code and comment, but for production use you are
strongly advised to use the "unofficial reference" implementation from schematron.org
This is not intended to be a long term "competing" implementation.
-->

7 comments:

Anonymous said...

Interesting.

I am currently doing some refactoring to Gestalt, to overcome a serious bug, but when I have finished, I'll try running your script using it, to see if I can detect any processor biases in the options (or indeed in ISO Schematron implementation - I do run it against the 1.5 (modified - to make it valid XSLT) reference implementation.

Colin Adams

David said...

Thanks, let me know of any problems, either here or on the schematron list (or anywhere else:-)

I must try gestalt. I'd sort of left it as the documentation implied it was very much under development, but I note that the page I was looking at says

Last Updated: Tuesday, August 31st, 2004

and I gather from comments from you (and dimitre) in other forums that actually it's getting to be a pretty capable xslt2 engine these days.

Anonymous said...

Having finally finished a huge refactoring (which has fixed a number of tail-recursion problems), I tried it out.

The first error is it requires exslt:node-set.
of course, i do not implement this, as it is useless in XSLT 2.0.

I have unsubscribed from the schematron love-in, as I never received any messages other than yahoo complaints.

I still read it online sometimes. I would subscribe again if it were not on yahoo.
Colin Adams

David said...

Colin, I don't see any dependency on
node-set...

ah. It has (inherited from Rick's code) a guard of

<xsl:when test="function-available('exsl:node-set')">

in the one place it uses it, but I suppose for 2.0 that isn't enough and there should be a use-when guard as well to stop compile time errors. I will fix that.

schematron-love-in isn't a yahoo list it's a mailman list managed from eccnet.com I must say though I had a lot of trouble subscribing, neither of the usual public accounts I use were accepted, so i ended up subscribing from a gmail account. It's very low volume though in any case.

David said...

I've added a use-when test and rejigged that code slightly. I don't think it has any dependencies on extensions now.


Updated at same URI as before.

Anonymous said...

Thanks.
That exposed a problem in my implementation which was easily fixed.

Now the problem is that the generated transformation requires the namespace axis, and yet is for version 2.0.

This is a bug as the namespace axis is optional and deprecated in 2.0, and therefore I do not support it.

Either the proper XPath 2.0 functions should be used, or a 1.0 compatilibilty-mode stylesheet generated.
Of course, the first option is much better.


Colin Adams

David said...

grr I commented out the offending use.

I think though that the decision not to support the namespace axis is a mistake. It's optional as a political concession towards XQuery (I'd guess) but it's useful, hard to replicate in all circumstances with the new inscope namespace functions, and not having it in your implementation will mean that the there will always be interoperability problems, I don't blame you for that I blame the WGs, (optional features are almost always a sign of an over large committee failing to agree about something) but I hope that implementers lessen the damage by implementing the optional features...

Still as I say, I zapped the namespace in this case, because whatever I say above the spec is unambiguous, and you are perfectly correct to state that a stylesheet intended to run with a 2.0 system shouldn't rely on it.