Hi George,
an interesting use case. Reading lines of a text file feels like a natural candidate for iterative processing. As we need to ensure that the accessed file will eventually be closed, it is completely parsed before its contents can be accessed (all this happens in [1]), In future, we could possibly avoid this by registering file handles in the global query context and closing files that remained opened after query execution.
What are your experiences with using a single thread? If memory consumption is too exhaustive, you could play with the window clause of the FLWOR expression [2,3]. It takes some time to explore the full magic of this XQuery 3.0 extension (the syntax is somewhat verbose), but it’s often a good alternative to complex functional code.
Feel free to keep us updated, Christian
[1] https://github.com/BaseXdb/basex/blob/master/basex-core/src/main/java/org/ba... [2] http://docs.basex.org/wiki/XQuery_3.0#window [3] https://www.w3.org/TR/xquery-30/#id-windows
On Tue, Jan 15, 2019 at 11:21 AM George Sofianos gsf.greece@gmail.com wrote:
Hello,
I'm trying to read a 4GB text file with 5 million lines and parse its contents. I'm using file:read-text-lines function to do that. I managed to use fork-join and use 16 CPU threads to read the whole file by reading 10000 lines in each iteration, but it still takes 500 seconds for parsing / analyzing the data. Using a profiler I can see that most of the time is wasted reading each line - method readline. I plan to make some changes on the code tonight and see if I can find a way to read it faster, but I thought I should also post it here in case you have any tips. I'm also very inexperienced with using profilers so I hope I read the output correctly :)
Regards,
George