I am using BaseX to scrape data from a web site. This web site, probably like many other websites, relies on cookies and if it does not receive the expected cookies it delivers a page instructing you to enable cookies in your browser. I was able to get this working by parsing the http:header response to get the cookies to use in subsequent requests. This is the second time I've done this, and even though this works it seems a bit hacky. Is there a standard way of handling cookies using the HTTP Module or the Fetch module? Or, are there any well written code examples available?
In other environments typically you define a cookie jar in some way, and the cookie jar is used (and is updated) automatically in all subsequent HTTP requests. I'm hoping to find something similar in BaseX.
Thanks, Vincent
Hi Vincent,
So far, I'm not aware of a standard solution to handle and cache client-side cookies with BaseX. Could you show us your solution? It might help us to discuss alternative solutions.
Best, Christian
On Thu, Jul 9, 2015 at 8:30 PM, Lizzi, Vincent Vincent.Lizzi@taylorandfrancis.com wrote:
I am using BaseX to scrape data from a web site. This web site, probably like many other websites, relies on cookies and if it does not receive the expected cookies it delivers a page instructing you to enable cookies in your browser. I was able to get this working by parsing the http:header response to get the cookies to use in subsequent requests. This is the second time I’ve done this, and even though this works it seems a bit hacky. Is there a standard way of handling cookies using the HTTP Module or the Fetch module? Or, are there any well written code examples available?
In other environments typically you define a cookie jar in some way, and the cookie jar is used (and is updated) automatically in all subsequent HTTP requests. I’m hoping to find something similar in BaseX.
Thanks, Vincent
Hi,
Correct me if I am wrong, but I believe the HTTP Client in BaseX is the EXPath HTTP Client? It was indeed designed to provide access to low-level, raw HTTP. It does not contain a lot of higher level feature based on HTTP itself. Indeed, you have to handle cookies yourself for instance.
The difficulty here, if I am right, is the side-effects required to pass information somehow (in a hidden way) between 2 different HTTP requests.
Any suggestion to improve the API is welcome (at least on the EXPath mailing list, I don't want to speak for BaseX developers, but I am pretty sure here as well :-)...)
Regards,
In my experience the case that causes the most problem is the authentication redirect. I have never tried this with BaseX but I have been very grateful in the past that XMLCalabash implements this:
"The exception arises in the case of redirection. If a redirect response includes cookies, those cookies are forwarded as appropriate to the redirected location when the redirection is followed." [1] /Andy
[1] http://xprocbook.com/book/refentry-19.html#cookies
On 10 July 2015 at 10:36, Florent Georges fgeorges@fgeorges.org wrote:
Hi,
Correct me if I am wrong, but I believe the HTTP Client in BaseX is the EXPath HTTP Client? It was indeed designed to provide access to low-level, raw HTTP. It does not contain a lot of higher level feature based on HTTP itself. Indeed, you have to handle cookies yourself for instance.
The difficulty here, if I am right, is the side-effects required to pass information somehow (in a hidden way) between 2 different HTTP requests.
Any suggestion to improve the API is welcome (at least on the EXPath mailing list, I don't want to speak for BaseX developers, but I am pretty sure here as well :-)...)
Regards,
-- Florent Georges http://fgeorges.org/ http://h2oconsulting.be/
On 10 July 2015 at 11:13, Christian Grün wrote:
Hi Vincent,
So far, I'm not aware of a standard solution to handle and cache client-side cookies with BaseX. Could you show us your solution? It might help us to discuss alternative solutions.
Best, Christian
On Thu, Jul 9, 2015 at 8:30 PM, Lizzi, Vincent Vincent.Lizzi@taylorandfrancis.com wrote:
I am using BaseX to scrape data from a web site. This web site, probably like many other websites, relies on cookies and if it does not receive
the
expected cookies it delivers a page instructing you to enable cookies in your browser. I was able to get this working by parsing the http:header response to get the cookies to use in subsequent requests. This is the second time I’ve done this, and even though this works it seems a bit
hacky.
Is there a standard way of handling cookies using the HTTP Module or the Fetch module? Or, are there any well written code examples available?
In other environments typically you define a cookie jar in some way,
and the
cookie jar is used (and is updated) automatically in all subsequent HTTP requests. I’m hoping to find something similar in BaseX.
Thanks, Vincent
The EXPath HTTP Client does seem to provide low level HTTP access. I am hoping to find an XQuery library that implements some common things such as cookies and authentication on top of HTTP Client, but haven’t come across such a library yet. There are a few OATH implementations for authentication though.
I’ll have a look at XML Calabash’s HTTP cookie handling.
Fortunately, in the project that I currently have authentication is not needed. Here is the code that I currently have working. A query can fetch URL(s) by calling local:httpGet(), which does a request to get the cookies that the web site requires, and then does request(s) to return the web page for each URL provided.
declare function local:httpResponseCookies($response as element(http:response)) as element(http:header) { let $setCookies := $response/http:header[@name = 'Set-Cookie']/@value/data() let $cookies := string-join(for $cookie in $setCookies return substring-before($cookie, '; '), '; ') return <http:header name="Cookie" value="{$cookies}"/> };
declare function local:httpGet($urls as xs:string+) as element(page)* { let $response := http:send-request(<http:request method='get'/>, $urls[1]) for $url in $urls let $response := http:send-request(<http:request method='get'> {local:httpResponseCookies($response[self::http:response])} </http:request>, $url) return element page { attribute url { $url }, $response[2] } };
Thanks, Vincent
From: basex-talk-bounces@mailman.uni-konstanz.de [mailto:basex-talk-bounces@mailman.uni-konstanz.de] On Behalf Of Andy Bunce Sent: Tuesday, July 14, 2015 12:11 PM To: Florent Georges Cc: BaseX Subject: Re: [basex-talk] HTTP module and cookies
In my experience the case that causes the most problem is the authentication redirect. I have never tried this with BaseX but I have been very grateful in the past that XMLCalabash implements this:
"The exception arises in the case of redirection. If a redirect response includes cookies, those cookies are forwarded as appropriate to the redirected location when the redirection is followed." [1] /Andy
[1] http://xprocbook.com/book/refentry-19.html#cookies
On 10 July 2015 at 10:36, Florent Georges <fgeorges@fgeorges.orgmailto:fgeorges@fgeorges.org> wrote: Hi,
Correct me if I am wrong, but I believe the HTTP Client in BaseX is the EXPath HTTP Client? It was indeed designed to provide access to low-level, raw HTTP. It does not contain a lot of higher level feature based on HTTP itself. Indeed, you have to handle cookies yourself for instance.
The difficulty here, if I am right, is the side-effects required to pass information somehow (in a hidden way) between 2 different HTTP requests.
Any suggestion to improve the API is welcome (at least on the EXPath mailing list, I don't want to speak for BaseX developers, but I am pretty sure here as well :-)...)
Regards,
-- Florent Georges http://fgeorges.org/ http://h2oconsulting.be/
On 10 July 2015 at 11:13, Christian Grün wrote:
Hi Vincent,
So far, I'm not aware of a standard solution to handle and cache client-side cookies with BaseX. Could you show us your solution? It might help us to discuss alternative solutions.
Best, Christian
On Thu, Jul 9, 2015 at 8:30 PM, Lizzi, Vincent <Vincent.Lizzi@taylorandfrancis.commailto:Vincent.Lizzi@taylorandfrancis.com> wrote:
I am using BaseX to scrape data from a web site. This web site, probably like many other websites, relies on cookies and if it does not receive the expected cookies it delivers a page instructing you to enable cookies in your browser. I was able to get this working by parsing the http:header response to get the cookies to use in subsequent requests. This is the second time I’ve done this, and even though this works it seems a bit hacky. Is there a standard way of handling cookies using the HTTP Module or the Fetch module? Or, are there any well written code examples available?
In other environments typically you define a cookie jar in some way, and the cookie jar is used (and is updated) automatically in all subsequent HTTP requests. I’m hoping to find something similar in BaseX.
Thanks, Vincent
basex-talk@mailman.uni-konstanz.de