Rick Jelliffe, acting as editor of ISO Schematron has a call for corrections and improvements to schematron, and also is working on updating the XSLT skeleton implementation, see this thread on the schematron-love list. This got me looking at schematron again, after a gap of some years...
The one suggestion that I made for a specification change is to allow schematron contexts to match text nodes. The current situation is very strange, schematron contexts (in XSLT1 or XSLT2 bindings) have (exactly) the syntax of XSLT patterns, but with an extra semantic restriction (not made explicitly in the specification) that any text nodes that are matched are ignored. The reference schematron implementation (current and previous version as far as I have seen) have implemented this inconsistently, applying schematron rules to text nodes (due to the fact that the default template for each pattern uses selects on node(), but not if the parent of the text node is the context of a schematron rule as then the recursion is via *|comment()|processing-instruction(). Rick has put this forward as an official change request to the working group, but seems unconvinced by the change, we'll see!
Meanwhile, looking at the code, I've not looked at the schematron implementation since I made the schematron-report (which google tells me was back in 1999, doesn't time fly:-) I think it's changed quite a bit since then, especially to accommodate the "skeleton" paradigm, but the basics are the same.
A trial implementation
There's been quite a bit of discussion on the schematron-love list about implementation changes, and posted code, but a mailing list isn't always the best place to discuss code samples, so I thought I'd try posting here.
Schematron allows you to specify several patterns and each rule has a context at which point several assertions can be made about XPath expressions. As noted above a context is more or less identical to an XSLT pattern, and the implementation of each rule in a pattern is as an XSLT template with a unique priority and a mode corresponding to the pattern to be used. For each pattern in the schematron schema, a full recursive tree walk is made via templates which process a node and then recursively process attributes and children. There are optimisation flags to try to speed this up, in particular by omitting attribute processing, and the specification that text nodes should not be visited, is partly due to a concern over this processing speed.
I argued on the list that visiting text nodes were not a serious time concern compared to other issues, and certainly not enough to distort the language. But for once I decided to heed Mike Kay's oft cited advice on xsl-list that performance questions should be answered by measuring timings.
I have made a modified schematron skeleton implementation that takes a number of parameters that control the way that the schematron patterns are implemented. Also available is a test document and test schema used in the test below.
The parameters controlling the generation of the XSLT are described below.
- visit-text
- This parameter defaults to 'false' at which setting text nodes are not visited when establishing the contexts for schematron patterns. This is the behaviour currently mandated by the specification. If it is set to 'true' text nodes are visited which is the behaviour but forward as a suggestion for a future schematron update. In some cases omitting text nodes is implemented by selecting *|comment()|processing-instruction() rather than node() (this does appear to save some time, but in other cases it is necessary to add a filter [not(self::text())] to explicitly skip text nodes, in which case not visiting text nodes may take extra time due to the extra check. Compare the times for 2 and 3 and 6 and 7 below. In all cases though on this test file the difference is only around 50ms in 1000, so not really significant enough to affect the language design.
- attributes
- Normally if recursing down the tree both the attribute and child nodes are selected, setting this to 'false' will cause the attribute to be omitted (the existing iso_schematron_skeleton has a similar parameter) one difference is that this defaults to 'true' if it is safe to do so (if no context in the schema has contains '@' or 'attribute').
- only-child-elements
- Similar to the attributes parameter above, this causes a test of node() to be changed to * so that just elements are visited on the child axis. Again this defaults to 'true' if it is safe to do so (no context contains a '(').
- select-contexts
- Perhaps the most radical option. Traditionally schematron has been implemented by a traditional walk over the tree where each step is accomplished by a template which process a node, and then recursively applies templates on child nodes and attributes. However this recursion is tail recursive and so is effectively an iteration, so an alternative implementation strategy is to first select all the nodes that need to be processed, and then process them with a non-recursive template that process a single node and does not call apply-templates.
- This defaults to '', which enables the classic recursive behaviour.
- If set to '//' then a an XPath is constructed from all the rule contexts in a pattern and used in a single select, so if a pattern has rules with contexts a b and c a select="//(a|b|c)" is generated. This method (on this test) leads to the quickest processor, but the difference over the default behaviour is not that great, and unfortunately it requires XSLT2 binding to allow the general step using (). To generate an XPath 1 expression would require too much parsing of the context than is really convenient in XSLT.
- If set to 'key' the behaviour is similar to the behaviour with '//' but rather than use an XPath using // an XSLT key is generated. This has the advantage that it is much easier to generate legal XSLT1, and my intuition was that this would be the fastest so long as memory for the key did not get out of hand. It seems that my intuition here was wildly, spectacularly wrong. As can be seen from the timings below, runs 4 and 5, using this key setting take around 20 times longer.
One unrelated and largely cosmetic change, the schematron implementation needs to generate unique priorities for each template that it generates. The exact numbers do not matter, just the relative ordering. The existing ISO implementation counts down from 4000. the 4000 always bothered me (I don't know why, really) so I have changed it here to count up from 1000 (keeping the relative ordering). Strictly speaking, counting down from 4000 isn't safe as if you have 4000 templates the priorities specified would be less than the priorities specified on default templates in the same mode, however the real reason is cosmetic. (I was tempted to count up from 1 and lower the default templates more, but I stared from 1000, just to give them all 4 digit numbers.
Results for one test document
The following script was used, it's a bash script but can just be viewed as a sequence of command line instructions and equivalent code could be used in any operating system
Run 1 uses the current schematron beta skeleton, as a base comparison., runs 2 to 7 uses the modified implementation using various combinations of the parameters described above.
the schematron output (log1.txt ... log7.txt) is checked with the unix diff command: no output here is good, which means that the output is the same in each case. the -3 flag to saxon causes it to run the code three times and report the average time (thus avoiding measuring JVM startup and stylesheet compilation times).
#!/bin/bash
rm temp?.xsl
saxon8 -novw -o temp1.xsl dpc.sch
iso_schematron_skeleton.xsl
saxon8 -novw -o temp2.xsl dpc.sch
dpc_schematron_skeleton.xsl
saxon8 -novw -o temp3.xsl dpc.sch
dpc_schematron_skeleton.xsl visit-text=true
saxon8 -novw -o temp4.xsl dpc.sch
dpc_schematron_skeleton.xsl select-contexts=key
saxon8 -novw -o temp5.xsl dpc.sch
dpc_schematron_skeleton.xsl visit-text=true
select-contexts=key
saxon8 -novw -o temp6.xsl dpc.sch
dpc_schematron_skeleton.xsl select-contexts=//
saxon8 -novw -o temp7.xsl dpc.sch
dpc_schematron_skeleton.xsl visit-text=true
select-contexts=//
rm log?.txt logg?.txt
saxon8 -3 -novw -o log1.txt book.xml temp1.xsl 2> logg1.txt
saxon8 -3 -novw -o log2.txt book.xml temp2.xsl 2> logg2.txt
saxon8 -3 -novw -o log3.txt book.xml temp3.xsl 2> logg3.txt
saxon8 -3 -novw -o log4.txt book.xml temp4.xsl 2> logg4.txt
saxon8 -3 -novw -o log5.txt book.xml temp5.xsl 2> logg5.txt
saxon8 -3 -novw -o log6.txt book.xml temp6.xsl 2> logg6.txt
saxon8 -3 -novw -o log7.txt book.xml temp7.xsl 2> logg7.txt
echo diff
diff log1.txt log2.txt
diff log1.txt log3.txt
diff log1.txt log4.txt
diff log1.txt log5.txt
diff log1.txt log6.txt
diff log1.txt log7.txt
echo grep
grep Average logg*txt
And the results are:
diff
grep
logg1.txt:*** Average execution time over 3 runs: 1078ms
logg2.txt:*** Average execution time over 3 runs: 1052ms
logg3.txt:*** Average execution time over 3 runs: 1062ms
logg4.txt:*** Average execution time over 3 runs: 21843ms
logg5.txt:*** Average execution time over 3 runs: 21885ms
logg6.txt:*** Average execution time over 3 runs: 1047ms
logg7.txt:*** Average execution time over 3 runs: 995ms
The default behaviour (2) of this script is a little faster than (1), probably because it visits fewer text nodes to more fully implement the current standard, When all text nodes are visited in run (3) it's slightly slower. As noted above (4) and (5) are unusable, unless this turns out to be a bug in either my code or the XSLT processor, this indicates that the 'key' option should be removed from any final release. 6 and 7 using // and not visiting or visiting text nodes turns out to be the fastest here, but probably not sufficiently different to make it worth having this as an option in a final release, unless perhaps other people with real documents who have found the classic recursive method slow report this as faster. (I suspect some documents will be faster, some slower, depending greatly on the documents and the underlying XSLT processor.
Licence
Most of the code in the modified skeleton is Copyright Rick Jelliffe, and the code is distributed under the same licence as his implementation of which this is just a minor modification. The comments at the top of the file say:
<!-- DPC
Modified schematron skeleton code by David Carlisle.
http://dpcarlisle.blogspot.com/search/label/schematron
All modifications are marked with XML comments starting
with " DPC".
The majority of the code is copyright
Rick Jelliffe and Academia Sinica Computing Center
distributed under the license below, see "LEGAL NOTICE".
The modified sections of code are the work of David Carlisle and are
distributed under the same licence. Rick Jelliffe or others are free to
incorporate David Carlisle's code back in to other schematron implementations
which use this licence, for any such incorporation, specific attribution is
not required, although the attribution style used for earlier contribution
is appreciated.
This is a test implementation for discussion on the schematron mailing list
suggesting some possible changes and enhancements for a schematron implementation.
People are welcome to try the code and comment, but for production use you are
strongly advised to use the "unofficial reference" implementation from schematron.org
This is not intended to be a long term "competing" implementation.
-->