Dear Sophie,
I assume that TagSoup is missing in your BaseX classpath. TagSoup is responsible for converting HTML pages to XML (see [1] for more details). By calling html:parser(), you can find out if HTML can be correctly converted [2].
By the way, the following query is an alternative solution for parsing HTML to XML. It gives you more control on the specific steps (but, once again, TagSoup must be in the classpath to successfully import HTML):
let $url := 'http://www.crealscience.fr/' let $text := fetch:text($url) let $xml := html:parse($text) return $xml
Hope this helps, Christian
[1] http://docs.basex.org/wiki/Parsers#HTML_Parser [2] http://docs.basex.org/wiki/HTML_Module#html:parser
On Tue, Feb 18, 2014 at 6:36 PM, Sophie Petit sophiepetit@gmail.com wrote:
Dear BaseX team,
I'm using BaseX for my studies with Xavier-Laurent Salvador from Paris13. I've got a puzzling issue with html:parse. I'm trying the request below using html:parse in order to get a list of the urls from a webpage and I'm getting this message error: "Ligne 19: Invalid character found: '"' "
for $x in (html:parse(http:send-request( <http:request method='get' override-media-type=' application/octet-stream' href= 'http://www.crealscience.fr/%27/
)
[2])//@href[matches(.,"http")]) return $x
The same request gives this kind of output errors with any url. Html:parse stops at any error in the page's HTML code. A header with "declare option output:method "text";" was added to the request but it didn't solve the problem. If I insert the same request in a RestXQ file, it works perfectly.
Do you have any suggestions to solve that problem?
Best, Sophie Petit (basex 7.8 on debian)
BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk