Hi,
If I use `fetch:xml($url, map{'parser':'html'})` all is fine!
The next one gives a correct result (although, in contrary to the browser, without namespaces and doctype):
let $url := "https://bartoszmilewski.com/2014/10/28/category-theory-for-programmers-the-p..." let $response := http:send-request(<http:request method='get'/>, $url) return $response[2]
This creates a mess:
let $url := "https://bartoszmilewski.com/2014/10/28/category-theory-for-programmers-the-p..." let $response := http:send-request(<http:request method='get'/>, $url) return html:parse($response[2])
gives (partial):
<html> <body>Category Theory for Programmers: The Preface | Bartosz Milewski's Programming Cafe/* */ if ( 'function' === typeof WPRemoteLogin ) { document.cookie = "wordpress_test_cookie=test; path=/"; if ( document.cookie.match( /(;|^)\s*wordpress_test_cookie=/ ) ) { WPRemoteLogin(); }
etc.
I also tried:
let $url := "https://bartoszmilewski.com/2014/10/28/category-theory-for-programmers-the-p..." let $response := http:send-request(<http:request method='get'/>, $url) return html:parse($response[2], map{ 'html': false(), 'lexical': true(), 'nocdata': true(), 'nodefaults': true(), 'nons': false() })
This adds only the xhtml namespace, but the rest is the same.
html:parser() -> "TagSoup"
I don't know enough about Java, to test which TagSoup version is in use via the GUI. I am getting this result, when using the GUI on Windows 10 with BaseX 9.0.1
As I have seen, there was a thread on this list in 2016 about eventual replacement of TagSoup: https://www.mail-archive.com/basex-talk@mailman.uni-konstanz.de/msg08928.htm...
What about https://about.validator.nu/htmlparser/ ?
Thank you.
Hi Andreas,
What about https://about.validator.nu/htmlparser/ ?
Thanks for the pointer; I will have a look at this parser.
This creates a mess:
let $url :=
"https://bartoszmilewski.com/2014/10/28/category-theory-for-programmers-the-p..." let $response := http:send-request(<http:request method='get'/>, $url) return html:parse($response[2])
The reason is that the HTTP response is of type node(). html:parse takes strings as arguments, and by calling html:parse, your node will be implicitly converted to an atomized string.
The following query should do the job:
let $url := "https://bartoszmilewski.com/2014/10/28/category-theory-for-programmers-the-p..." let $response := http:send-request( <http:request method='get' override-media-type='text/plain' href='{ $url }'/> )[2] return html:parse( $response, map { 'nons': false() } )
By the way, while running your queries, I noticed that html:parse didn’t accept binary input anymore. This has been fixed in the latest snapshot [1].
Apart from that, we currently work on an enhanced version of our HTTP Client Module (see [2]). Maybe we’ll drop the implicit response conversion in the new functions.
Cheers, Christian
[1] http://files.basex.org/releases/latest/ [2] https://github.com/BaseXdb/basex/issues/914
PS: You can also supply HTML parsing options via fetch:xml:
fetch:xml( 'http://basex.org/', map { 'parser': 'html', 'htmlparser': map { 'nons': false() } } )
In future, if you want to use the HTTP Module, your request could look as simple as this:
html:parse( http:get($url)?body, map { 'nons': false() } )
We are still working out if response parsing will be integrated in the http:get call:
let $serializer := map { 'parser': 'html', 'htmlparser': map { 'nons': false() } } return http:get( $url, map { 'serializer': $serializer } )('body')
Hi Christian,
thank you very much for your help. All is fine now :-)
You wrote:
We are still working out if response parsing will be integrated in the http:get call:
In this very case (I am crawling parts of a website recursively, pulling the document content and the binary data (images) and converting it into ePub2, completely recomposing the HTML), I am very happy about the response header available, since I can read out the media-type and interpret the response code. I believe, certain REST APIs also communicate additional information in the response header ('X-something: ' tags). But as long there is one function, that comes with the full featured response, I think that is enough. The user could write wrappers around that, easily.
I have checked your reference to https://github.com/BaseXdb/basex/issues/914. I am pretty content with the response being presented as XML! In my opinion it keeps the spirit of the XQuery process alive: a query against an XML backend. Though, I understand, from a pure programming language point of view, that having the result as a map() may be more appealing. In the request, however, I prefer a map with options. Just my ¢2.
basex-talk@mailman.uni-konstanz.de