Thursday, March 3, 2011

How can I prevent XML::XPath from fetching a DTD while processing an XML file?

My XML starts like this

$ cat a.xhtml

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
  "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
...

My code starts like this

use XML::XPath;

use XML::XPath::XMLParser;

my $xp = XML::XPath->new(filename => "a.xhtml");

my $nodeset = $xp->find('/html/body//table');

It's very slow, and it turns out that it spends a lot of time getting the DTD (http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd).

Is there a way to explicitly declare an http proxy server in the Perl XML:: family? I hate to modify the original a.xhtml document like having a local copy of the dtd.

Thanks

From stackoverflow
  • Usually it's done by setting up local XML catalog.

    libxml-based parsers support it, so if you follow mirod's advice, you'll be able to get named entities and validation work without network access.

    mirod : True. You could probably use XML::Catalog to add a catalog to an XML::Parser object, and use that parser in XML::XPath's new. I have never tested that though.
  • XML::XPath is based on XML::Parser. There is an option in XML::Parser to NOT use LWP to resolve external entities (such as DTDs). And XML::XPath lets you pass an XML::Parser objetc, to use as the parser.

    So you can write this:

    my $p = XML::Parser->new( NoLWP => 1);
    my $xp= XML::XPath->new( parser => $p, filename => "a.xhtml");
    

    Note that in this case you will loose all entities except numerical ones and the default ones (>, <, &, ' and "). The parser will not complain, but they will disappear silently (try including &alpha; in the table and printing it for example).

    As a matter of fact you probably should not use XML::XPath, which is not actively maintained.

    Try XML::LibXML, if you have no problem with installing libxml2, its interface is very similar to XML::XPath as they both implement the DOM. XML::LibXML is also much more powerful than XML::XPath, and faster to boot. If you want an expat/XML::Parser based module, they you might want to have a look at XML::Twig (that's blatant self-promotion as I am the author of the module, sorry). Also for HTML/dodgy XHTML, you can use HTML::TreeBuilder, which, with the addition of HTML::TreeBuilder::XPath (also by me), supports XPath.

0 comments:

Post a Comment