Monday, 23 April 2007

htmlparse, updated

There are two well known open source tools for cleaning up and fixing HTML to produce XML suitable for further processing, tidy and TagSoup.

Back in 2004 I posted some XSLT code, htmlparse.xsl, to do a similar job and used it in a few examples on the xsl-list.

When I needed such a tool for the MathML from Office post, my first thought was to use the htmlparse stylesheet. Unfortunately the output from Word was too much for the stylesheet, which couldn't cope with the Microsoft conditional declarations.

However it turned out not to be hard to extend the stylesheet, so I have updated htmlparse.xsl to fix a couple of bugs in attribute handling, and to extend it to cope with <![if !vml]> and friends. This now gives a pure XSLT route of getting from Word output to XHTML+MathML. Using the same test file as the earlier post, the stylesheet:

<xsl:stylesheet 
   version="2.0"
   xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
   xmlns:d="data:,dpc"> 
  <xsl:import href="../htmlparse/htmlparse.xsl"/>
  <xsl:import href="xhtml-mathml.xsl"/>
  <xsl:template name="main">
    <xsl:apply-templates select="d:htmlparse(
              replace(
                  unparsed-text('test1.htm','windows-1252'),
                  'http://www.w3.org/TR/REC-html40',
                  'http://www.w3.org/1999/xhtml'))"/>
  /xsl:template>
</xsl:stylesheet>

Produces an XHTML+MathML document as shown.

Although it's used here to convert an entire document, TagSoup, which interfaces to XSLT as a SAX parser, is probably preferable in this context. htmlparse (originally written just as an excercise in XSLT2 regular expressions) is probably more useful for parsing small fragments of html, as often found in query strings or embedded in CDATA in XML files. It has the advantage of being pure XSLT, not requiring any extension functions or non standard parser usage.

2009-12-18 updated location of htmlparse to google code

23 comments:

Michael Day said...

There is also xmllint from libxml2, which includes a HTML parser.

David said...

oops, thanks for the correction, I should have listed that as well (I did know about it, or have known about it), apologies to Daniel Veillard!

Anonymous said...

Or if you fancy some Ruby, Hpricot is all the rage:

Hpricot(open(some_url), :xhtml_strict => true)

Jesper Tverskov said...

I have tested htmlparse.xsl to see if it can make documents made with Google's Writely well-formed.

It works very well except for tables:

Almost any standard Writely table contains an attribute looking like this:

borderColor=#000000

and that is, I think what makes htmlparse.xsl fail.

If I add quotes: borderColor="#000000", htmlparse.xsl works.

I find it extremely useful to have an XSLT stylesheet that can make even the most dirty markup well-formed, and I hope that htmlparse.xsl will be updated.

Cheers,
Jesper Tverskov

David said...

ohh, could you add #? just before the \c so that htmlparse.xsl looks like

xsl:variable name="d:attr"
select="'(\i\c*)\s*(=\s*("[^"]*"|''[^'']*''|#?\c+))?\s*'"/>

and let me know if that works, it seemed to work on a couple of cases
with an un quoted colour spec that I tried...

thanks for your interest,

David

Jesper Tverskov said...

Yes, #? works well.

Jesper Tverskov said...

If I use escaped markup d:htmlparse returns escaped "less than" as "...amp;#60;" and escaped "greater than" as escaped "greater than".

I don't like this "...amp;#60;"!

David said...

down at the end of the file you'll find a list of entity definitions. lt is double escaped (which matches it's definition in XML, but I think is probably wrong here, and amp seems to be missing altogether.

It's a bit later for thinking, but I think you probably want

[entity name="lt">@#60;[/entity>
[entity name="amp">@#38;[/entity>

(using [ for < and @ for & as blogger doesn't allow tags here..
at least that does the right thing on an example I trued...

If people are using htmlparse, i probably ought to move it to a more sensible code site, google code or something...

Jesper Tverskov said...

The original htmlparse.xsl contains:

[entity name="lt">@#38;#60;[/entity>

So we just need to delete #38;

Both "amp" and "apos" are not listed, but I have a feeling that they are known in advance?

David said...

yes that's the change to lt.

I think you do need to add amp and apos to that list, they are "known in advance" by some things but not by the htmlparse parser:-( .

try a text of a &_amp; b (without the _) with and without amp in that list. if amp isn't there it will see this as a &
and quote that leaving the amp; as following text.

David said...

I updated the stylesheet on the website with these changes (and added the HTML5 uppercase entities AMP COPY etc at the same time.

Thanks for your comments

David

Jesper Tverskov said...

I have a strong feeling that we soon run into "too many nested function calls":

Engine name: Saxon-SA 9.1.0.6
Severity: fatal
Description: Too many nested function calls. May be due to infinite recursion.

To test it, I made a Google Writely test document of just three paragraphs of altogether some 400 words, every second word has red text color.

That is enough to get: Too many nested function calls!

What can be done?

David said...

> What can be done?

depends...


Can you mail me the html file
offline?

David said...

I have tweaked the regex for unquoted attributes a bit more and tried making a document in google docs with tables and lists and colour and font changes etc.

Exported it as a zip file with a 2000 line html file inside which parses correctly as far as I can see....

The stylesheet is in the usual place but only change is to the d:attr regex

select="'(\i\c*)\s*(=\s*("[^"]*"|''[^'']*''|[#/%]*\c+[#/%]*\c*[#/%]*\c*[#/%]*\c*))?\s*'"

which needs to be on one line, however it looks here.

David

Jesper Tverskov said...

Here are two test documents.

http://docs.google.com/Doc?id=dfrwr5tp_23gkcgm6c5

The first works if the last of three paragraphs is deleted.

http://docs.google.com/Doc?id=dfrwr5tp_24zgbw9vf8

The second works if the last 15 words are not colored blue.

David said...

I updated the stylesheet, your files should work now.

Thanks to Michael Kay

http://sourceforge.net/mailarchive/message.php?msg_name=5B408B7F53D54AF8B87856BF2BAE70EF%40Sealion

Jesper Tverskov said...

Something is wrong. Nothing works anymore. The two original test files, see above, give same error.

Other test files that used to work now give this error message:

SystemID: C:\Inetpub\wwwroot\writely2xhtml\htmlparse.xsl
Engine name: Saxon-SA 9.1.0.6
Severity: fatal
Description: Required item type of first argument of name() is node(); supplied value has item type xs:string
Start location: 537:0
URL: http://www.w3.org/TR/xpath20/#ERRXPTY0004

Line 537:
...xsl:element name="{if(string(@name))then @name else 'xml'}"
namespace="{$nns[name()=substring-before(current()/@name,':')][last()][not(.='data:,dpc')]}">

I use newest htmlparse.xsl:
$Id: htmlparse.xsl,v 1.30 2009/05/08 09:48:54 David Carlisle Exp $

David said...

Sorry. Try now. (I did test your document, but using the three argument form that turned off the adding of namespaces, which as luck would have it, I'd just broken by careless editing)

Jesper Tverskov said...

Still problems!

What use to work is now again working but my two test documents still return "too many nested function calls" error.

I use the one argument form: d:htmlparse(string)?

David said...

hmmm using

Saxon 9.1.0.1J from Saxonica
Java version 1.6.0_11


both your files run, I modified my test to pull the stylesheet and the document from the web rather than the local disk when I run

saxon9 -it main ht6.xsl

I get result1.xml and result2.xml both well formed xml rendering in FF more or less like your original documents.

I'll put all of them on the site

http://www.dcarlisle.demon.co.uk/ht6.xsl

http://www.dcarlisle.demon.co.uk/result1.xml

http://www.dcarlisle.demon.co.uk/result2.xml

Jesper Tverskov said...

Until now I have used Saxon-SA 9.1.0.6 from inside Oxygen 10.2 and load directly from Google Docs. I do a little pre-cleaning before htmlparse.xsl but it is only about nbsp in footnotes and about trailing spaces in some elements.

I need a good nights sleep but will try command line transformation tomorrow.

Jesper Tverskov said...

Strange!

All my own files work as they should from the command line with newest Saxon B.

In Oxygen 10.2 with newest Saxon SA, I get the "too many nested" problem with the two test files.

Tried my stylesheets in XMLSpy using AltovaXML: XMLSpy freezes!

What is the likely problem in Oxygen?

Jesper Tverskov said...

I have made both test files 10 times longer and they are still transformed correctly with Saxon at the command line.

But not with Saxon from inside Oxygen or from inside Stylus Studio and not with AltovaXML from inside XMLSpy.

I have filed the bug at Oxygen, Stylus Studio and XMLSpy, and I will report back when they have solved the issue.