Hello again,
I implemented this and it looks like it works nicely (to be confirmed soon - I started a run on a 600k records collection).
This runs nicely, in that the machine doesn't run out of memory anymore. There is one thing I noticed however, and that I had noticed earlier on as well when a big collection was being processed: any attempt to talk with the server seems not to be working, i.e. even when I try to connect via the command-line basexadmin and run a command such as "list" or "open db foo", I do not get a reply. I can see the commands in the log though:
17:28:06.532 [127.0.0.1:33112] LOGIN admin OK 17:28:08.158 [127.0.0.1:33112] LIST 17:28:21.288 [127.0.0.1:33114] LOGIN admin OK 17:28:25.602 [127.0.0.1:33114] LIST 17:28:52.676 [127.0.0.1:33116] LOGIN admin OK
Could it be that the long session is blocking the output stream coming from the server?
Thanks,
Manuel
On Mon, May 21, 2012 at 4:40 PM, Manuel Bernhardt bernhardt.manuel@gmail.com wrote:
Hi Christian,
as you have already seen, all results are first cached by the client if they are requested via the iterative query protocol. In earlier versions of BaseX, results were returned in a purely iterative manner -- which was more convenient and flexible from a user's point of view, but led to numerous deadlocks if reading and writing queries were mixed.
If you only need parts of the requested results, I would recommend to limit the number of results via XQuery, e.g. as follows:
( for $i in /record[@version = > 0] order by $i/system/index return $i) [position() = 1 to 1000]
I had considered this, but haven't used that approach - yet - mainly because I wanted to try the streaming approach first. So far our system only used MongoDB and we are used to working with cursors as query results, so I'm trying to keep that somehow aligned if possible.
Next, it is important to note that the "order by" clause can get very expensive, as all results have to be cached anyway before they can be returned. Our top-k functions will probably give you better results if it's possible in your use case to limit the number of results [1].
Ok, thanks. If this becomes a problem, I'll consider using this. Is the query time of 0.06ms otherwise the actual time the query takes to run? If yes then I'm not too worried about query performance :) In general, the bottleneck in our system is not so much the querying but rather the processing of the records - I started rewriting this one concurrently using Akka, but am now stuck with a classloader deadlock (no pun intended). It will likely take quite some effort for the processing to be faster than the query iteration.
A popular alternative to client-side caching (well, you mentioned that already) is to overwrite the code of the query client, and directly process the returned results. Note, however, that you need to loop through all results, even if you only need parts of the results.
I implemented this and it looks like it works nicely (to be confirmed soon - I started a run on a 600k records collection).
Thanks for your time!
Manuel
Hope this helps, Christian
[1] http://docs.basex.org/wiki/Higher-Order_Functions_Module#hof:top-k-by