Hi Christian,

what I failed to mention last time was that I was using the offset / limit mode of the file:read-text-lines. I never tried to load the whole file into memory with the previous version, because I thought it would be inefficient. I just tried now with the latest snapshot using a single core and while the whole file is being loaded into memory (4GB+), the process completes in about 120 seconds, which is fine for me. Using the offset mode looks to still be more memory efficient (stays around 1-1,3GB), but is very slow (both single core and multi core).

One issue, I can't make the non offset version work with fork-join. It fills the whole memory quickly, so I guess it reads the whole file into memory for each thread(?) - I tried up to 12GB. I've also noticed that in both versions (old and new snapshot), interrupting the fork-join mode will keep the threads running until I manually kill the BaseX process. Maybe I'm doing something wrong, or maybe I'm asking too much from fork-join :) I will try with the window clause tomorrow, maybe it will help. I'm posting an example of my code to help explain better my use case. For now, it is fine because I'm only reading a 4GB file, but potentially I might have to read up to 200GB files so having multi-core capabilities will help.

let $data := file:read-text-lines($file, "UTF-8", false())
let $count := count($data)

let $all :=
xquery:fork-join(
  for $i in $data return function() {
  parse-json($i)?('object1')?*?('object2')?('object3')
  }
)
return distinct-values($all)

Regards,

George

On 1/15/19 1:48 PM, Christian Grün wrote:
Hi George,

I’m glad to announce that files are now processed in an iterative
manner [1,2]. That’s something I wanted to try a while ago, and your
mail was another motivation to get it done.

It works pretty fine: I reduced the JVM memory to a tiny maximum of
4mb, and I managed to count the line numbers of a file with several
gigabytes:

  count(file:read-text-lines('huge.txt'))

I’d be interested to hear if your code runs faster with the latest snapshot.
Christian

[1] http://files.basex.org/releases/latest/
[2] https://github.com/BaseXdb/basex/commit/cfb7a7965de85139ec9595a6e79a45d873da7c25