Re: [basex-talk] file:read-text-lines performance

16 Jan 2019


      ...
Using the offset mode looks to still be more memory efficient (stays around 1-1,3GB), but is very slow (both single core and multi core).
A general note on xquery:fork-join (you may be aware of that anyway):
While it may sound enticing to use the function for as many jobs as
possible, it is often slower than clever single-core processing. The
reason is that (in your case) a file will be accessed by several
competing threads (which leads to random access I/O patterns, which
are difficult to schedule for the OS, even with SSDs). Even in Java
programming, code is often faster that doesn’t use the Java 8
streaming features and, instead, relies on the internal JVM
optimizations for distributing atomic operations to multiple cores.
...
let $data := file:read-text-lines($file, "UTF-8", false())
let $count := count($data)
let $all :=
xquery:fork-join(
  for $i in $data return function() {
  parse-json($i)?('object1')?*?('object2')?('object3')
  }
)
This code will potentially create thousands or millions of Java
threads. Maybe you are getting better results by splitting your input
into 4 or 8 parts, and process each part in a dedicated function.
I would indeed assume that the following code…
distinct-values(
  for $line in file:read-text-lines($file, "UTF-8", false())
  return parse-json($line)?('object1')?*?('object2')?('object3')
)
…will be most efficient, even if you process files of 100 GB or more
(especially with the new, iterative approach).

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

Re: [basex-talk] file:read-text-lines performance