Just posting to say I'm having a lot of fun with the updated
read-text-lines function.
This code will potentially create thousands or millions of Java threads. Maybe you are getting better results by splitting your input into 4 or 8 parts, and process each part in a dedicated function.
I refactored the code to the following, and it completes in 60
seconds, of which 20 are for counting the lines and only 40
seconds for parsing and returning the correct data!!! So I get a
3x improvement from multiple threads. I have no idea if it
stresses the SSD at all.
let $file := "/path/to/large.txt"
let $count := prof:time(count(file:read-text-lines($file,
"UTF-8", false())), "COUNTING: ")
let $cpus := 15
let $parts := ($count div $cpus) => xs:integer() =>
trace("PER CORE: ")
let $all :=
xquery:fork-join(
for $cpu in 0 to $cpus
return function() {
let $offset := $cpu * $parts
let $length := $parts
for $line in file:read-text-lines($file, "UTF-8", false(),
$offset, $length)
return parse-json($line)?('obj1')?*?('obj2')?('obj3')
}) => prof:time("CALCULATING: ")
return distinct-values($all)
I would indeed assume that the following code… distinct-values( for $line in file:read-text-lines($file, "UTF-8", false()) return parse-json($line)?('object1')?*?('object2')?('object3') ) …will be most efficient, even if you process files of 100 GB or more (especially with the new, iterative approach).
Indeed, it is also using tiny amounts of memory and completes in the same time (120 seconds) with loading the whole file into memory on a single core :)
George.