The EXPath HTTP Client does seem to provide low level HTTP access. I am hoping to find an XQuery library that implements some common things such as cookies and
authentication on top of HTTP Client, but haven’t come across such a library yet. There are a few OATH implementations for authentication though.
I’ll have a look at XML Calabash’s HTTP cookie handling.
Fortunately, in the project that I currently have authentication is not needed. Here is the code that I currently have working. A query can fetch URL(s) by calling
local:httpGet(), which does a request to get the cookies that the web site requires, and then does request(s) to return the web page for each URL provided.
declare function local:httpResponseCookies($response as element(http:response)) as element(http:header) {
let $setCookies := $response/http:header[@name = 'Set-Cookie']/@value/data()
let $cookies := string-join(for $cookie in $setCookies return substring-before($cookie, '; '), '; ')
return <http:header name="Cookie" value="{$cookies}"/>
};
declare function local:httpGet($urls as xs:string+) as element(page)* {
let $response := http:send-request(<http:request method='get'/>, $urls[1])
for $url in $urls
let $response := http:send-request(<http:request method='get'>
{local:httpResponseCookies($response[self::http:response])}
</http:request>, $url)
return element page { attribute url { $url }, $response[2] }
};
Thanks,
Vincent
From: basex-talk-bounces@mailman.uni-konstanz.de [mailto:basex-talk-bounces@mailman.uni-konstanz.de]
On Behalf Of Andy Bunce
Sent: Tuesday, July 14, 2015 12:11 PM
To: Florent Georges
Cc: BaseX
Subject: Re: [basex-talk] HTTP module and cookies
In my experience the case that causes the most problem is the authentication redirect. I have never tried this with BaseX but I have been very grateful in the past that XMLCalabash implements this:
"The exception arises in the case of redirection. If a redirect response includes cookies, those cookies are forwarded as appropriate to the redirected location when the redirection is followed."
[1]
/Andy
[1] http://xprocbook.com/book/refentry-19.html#cookies
On 10 July 2015 at 10:36, Florent Georges <fgeorges@fgeorges.org> wrote:
Hi,
Correct me if I am wrong, but I believe the HTTP Client in BaseX is
the EXPath HTTP Client? It was indeed designed to provide access to
low-level, raw HTTP. It does not contain a lot of higher level
feature based on HTTP itself. Indeed, you have to handle cookies
yourself for instance.
The difficulty here, if I am right, is the side-effects required to
pass information somehow (in a hidden way) between 2 different HTTP
requests.
Any suggestion to improve the API is welcome (at least on the EXPath
mailing list, I don't want to speak for BaseX developers, but I am
pretty sure here as well :-)...)
Regards,
--
Florent Georges
http://fgeorges.org/
http://h2oconsulting.be/
On 10 July 2015 at 11:13, Christian Grün wrote:
> Hi Vincent,
>
> So far, I'm not aware of a standard solution to handle and cache
> client-side cookies with BaseX. Could you show us your solution? It
> might help us to discuss alternative solutions.
>
> Best,
> Christian
>
>
>
> On Thu, Jul 9, 2015 at 8:30 PM, Lizzi, Vincent
> <Vincent.Lizzi@taylorandfrancis.com> wrote:
>> I am using BaseX to scrape data from a web site. This web site, probably
>> like many other websites, relies on cookies and if it does not receive the
>> expected cookies it delivers a page instructing you to enable cookies in
>> your browser. I was able to get this working by parsing the http:header
>> response to get the cookies to use in subsequent requests. This is the
>> second time I’ve done this, and even though this works it seems a bit hacky.
>> Is there a standard way of handling cookies using the HTTP Module or the
>> Fetch module? Or, are there any well written code examples available?
>>
>> In other environments typically you define a cookie jar in some way, and the
>> cookie jar is used (and is updated) automatically in all subsequent HTTP
>> requests. I’m hoping to find something similar in BaseX.
>>
>> Thanks,
>> Vincent