Hi,
somehow it deems to me, that I once saw a function in BaseX extensive module library, that would recursively read a web-page from remote, following all the links in it. But now I can't find it. Am I dreaming?
I guess you were dreaming ;) But it should definitely be possible to realize this in XQuery without too many lines of code..
Andreas Mixich mixich.andreas@gmail.com schrieb am Di., 31. Juli 2018, 01:43:
Hi,
somehow it deems to me, that I once saw a function in BaseX extensive module library, that would recursively read a web-page from remote, following all the links in it. But now I can't find it. Am I dreaming?
-- Goody Bye, Minden jót, Mit freundlichen Grüßen, Andreas Mixich
Am 31.07.2018 um 08:51 schrieb Christian Grün:
I guess you were dreaming ;) But it should definitely be possible to realize this in XQuery without too many lines of code..
Ok, then that's what I am going to do. Thanks for clarification.
Hi Andreas,
Just for fun, I wrote a little crawler in XQuery (see the attached files).
Please note that it’s just a stub; and it should surely be used decently, otherwise the remote server might block further access.
Cheers, Christian
On Wed, Aug 1, 2018 at 8:08 AM Andreas Mixich mixich.andreas@gmail.com wrote:
Am 31.07.2018 um 08:51 schrieb Christian Grün:
I guess you were dreaming ;) But it should definitely be possible to realize this in XQuery without too many lines of code..
Ok, then that's what I am going to do. Thanks for clarification.
-- Goody Bye, Minden jót, Mit freundlichen Grüßen, Andreas Mixich
Am 01.08.2018 um 09:56 schrieb Christian Grün:
Just for fun, I wrote a little crawler in XQuery (see the attached files).
Very interesting, indeed! Nice to see an example of lazy:cache and prof:dump. I did not use them, so far, and that is some good news to see them in action.
it should surely be used decently, otherwise the remote server might block further access.
Sure! That is one reason I am grinding my teeth on some link analysation right now, so the crawl can be limited to URIs of (a) certain kind(s).
P.S. Resending, since my MUA filled the wrong address (Christian's) instead of the list's into the TO: field and I forgot about it.
Hi Andreas, Christian,
Here attached is a module that I wrote a while ago to limit the rate of requests sent to a web server. This module has been useful in accessing APIs where the SLA does not allow more than a certain number requests per minute, and might be useful for this web crawling scenario. Although Cristian's crawler module already has a sleep built in to it.
Cheers, Vincent
-----Original Message----- From: BaseX-Talk basex-talk-bounces@mailman.uni-konstanz.de On Behalf Of Christian Grün Sent: Wednesday, August 01, 2018 3:57 AM To: Andreas Mixich mixich.andreas@gmail.com Cc: BaseX basex-talk@mailman.uni-konstanz.de Subject: Re: [basex-talk] Wasn't there a function, that would walk a website?
Hi Andreas,
Just for fun, I wrote a little crawler in XQuery (see the attached files).
Please note that it’s just a stub; and it should surely be used decently, otherwise the remote server might block further access.
Cheers, Christian
On Wed, Aug 1, 2018 at 8:08 AM Andreas Mixich mixich.andreas@gmail.com wrote:
Am 31.07.2018 um 08:51 schrieb Christian Grün:
I guess you were dreaming ;) But it should definitely be possible to realize this in XQuery without too many lines of code..
Ok, then that's what I am going to do. Thanks for clarification.
-- Goody Bye, Minden jót, Mit freundlichen Grüßen, Andreas Mixich
Vincent Lizzi wrote:
Hi Andreas, Christian,
Here attached is a module that I wrote a while ago to limit the rate of requests sent to a web server. This module has been useful in accessing APIs where the SLA does not allow more than a certain number requests per minute, and might be useful for this web crawling scenario. Although Cristian's crawler module already has a sleep built in to it.
Nice! I think I'll just lean back and wait for the code to roll in ;-)
Finally a full example of unit testing in the wild! I never really could wrap my mind around it. Also interesting to see the call to the Java environment. These are advanced techniques (though, they look simple enough, for the start, at least), which I yet have to come by.
basex-talk@mailman.uni-konstanz.de