Re: [basex-talk] file:read-text-lines performance

15 Jan 2019


      Hi George,
an interesting use case. Reading lines of a text file feels like a
natural candidate for iterative processing. As we need to ensure that
the accessed file will eventually be closed, it is completely parsed
before its contents can be accessed (all this happens in [1]), In
future, we could possibly avoid this by registering file handles in
the global query context and closing files that remained opened after
query execution.
What are your experiences with using a single thread? If memory
consumption is too exhaustive, you could play with the window clause
of the FLWOR expression [2,3]. It takes some time to explore the full
magic of this XQuery 3.0 extension (the syntax is somewhat verbose),
but it’s often a good alternative to complex functional code.
Feel free to keep us updated,
Christian
[1] https://github.com/BaseXdb/basex/blob/master/basex-core/src/main/java/org/ba...
[2] http://docs.basex.org/wiki/XQuery_3.0#window
[3] https://www.w3.org/TR/xquery-30/#id-windows
On Tue, Jan 15, 2019 at 11:21 AM George Sofianos gsf.greece@gmail.com wrote:
...
Hello,
I'm trying to read a 4GB text file with 5 million lines and parse its contents. I'm using file:read-text-lines function to do that. I managed to use fork-join and use 16 CPU threads to read the whole file by reading 10000 lines in each iteration, but it still takes 500 seconds for parsing / analyzing the data. Using a profiler I can see that most of the time is wasted reading each line - method readline. I plan to make some changes on the code tonight and see if I can find a way to read it faster, but I thought I should also post it here in case you have any tips. I'm also very inexperienced with using profilers so I hope I read the output correctly :)
Regards,
George

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

Re: [basex-talk] file:read-text-lines performance