Hi Christian,
what I failed to mention last time was that I was using the
offset / limit mode of the file:read-text-lines. I never tried
to load the whole file into memory with the previous version,
because I thought it would be inefficient. I just tried now with
the latest snapshot using a single core and while the whole file
is being loaded into memory (4GB+), the process completes in
about 120 seconds, which is fine for me. Using the offset mode
looks to still be more memory efficient (stays around 1-1,3GB),
but is very slow (both single core and multi core).
One issue, I can't make the non offset version work with
fork-join. It fills the whole memory quickly, so I guess it
reads the whole file into memory for each thread(?) - I tried up
to 12GB. I've also noticed that in both versions (old and new
snapshot), interrupting the fork-join mode will keep the threads
running until I manually kill the BaseX process. Maybe I'm doing
something wrong, or maybe I'm asking too much from fork-join :)
I will try with the window clause tomorrow, maybe it will help.
I'm posting an example of my code to help explain better my use
case. For now, it is fine because I'm only reading a 4GB file,
but potentially I might have to read up to 200GB files so having
multi-core capabilities will help.
let $data := file:read-text-lines($file,
"UTF-8", false())
let $count := count($data)
let $all :=
xquery:fork-join(
for $i in $data return function() {
parse-json($i)?('object1')?*?('object2')?('object3')
}
)
return distinct-values($all)
Regards,
George
Hi George, I’m glad to announce that files are now processed in an iterative manner [1,2]. That’s something I wanted to try a while ago, and your mail was another motivation to get it done. It works pretty fine: I reduced the JVM memory to a tiny maximum of 4mb, and I managed to count the line numbers of a file with several gigabytes: count(file:read-text-lines('huge.txt')) I’d be interested to hear if your code runs faster with the latest snapshot. Christian [1] http://files.basex.org/releases/latest/ [2] https://github.com/BaseXdb/basex/commit/cfb7a7965de85139ec9595a6e79a45d873da7c25