file:read-text-lines performance

List overview All Threads
Download

newer

older

Documentation suggestion

Concurrency WebDAV vs. REST API

George Sofianos

15 Jan 2019 15 Jan '19

5:21 a.m.

Hello,

I'm trying to read a 4GB text file with 5 million lines and parse its contents. I'm using file:read-text-lines function http://docs.basex.org/wiki/File_Module#file:read-text-lines to do that. I managed to use fork-join and use 16 CPU threads to read the whole file by reading 10000 lines in each iteration, but it still takes 500 seconds for parsing / analyzing the data. Using a profiler I can see that most of the time is wasted reading each line - method readline https://github.com/BaseXdb/basex/blob/0ef57de84659263c565ec41fff666ba5fa4f07dd/basex-core/src/main/java/org/basex/io/in/NewlineInput.java. I plan to make some changes on the code tonight and see if I can find a way to read it faster, but I thought I should also post it here in case you have any tips. I'm also very inexperienced with using profilers so I hope I read the output correctly :)

Regards,

George

Attachments:

attachment.html (text/html — 1.2 KB)

Show replies by date

Christian Grün

15 Jan 15 Jan

5:43 a.m.

Hi George,

an interesting use case. Reading lines of a text file feels like a natural candidate for iterative processing. As we need to ensure that the accessed file will eventually be closed, it is completely parsed before its contents can be accessed (all this happens in [1]), In future, we could possibly avoid this by registering file handles in the global query context and closing files that remained opened after query execution.

What are your experiences with using a single thread? If memory consumption is too exhaustive, you could play with the window clause of the FLWOR expression [2,3]. It takes some time to explore the full magic of this XQuery 3.0 extension (the syntax is somewhat verbose), but it’s often a good alternative to complex functional code.

Feel free to keep us updated, Christian

[1] https://github.com/BaseXdb/basex/blob/master/basex-core/src/main/java/org/ba... [2] http://docs.basex.org/wiki/XQuery_3.0#window [3] https://www.w3.org/TR/xquery-30/#id-windows

On Tue, Jan 15, 2019 at 11:21 AM George Sofianos gsf.greece@gmail.com wrote:

...

Hello,

I'm trying to read a 4GB text file with 5 million lines and parse its contents. I'm using file:read-text-lines function to do that. I managed to use fork-join and use 16 CPU threads to read the whole file by reading 10000 lines in each iteration, but it still takes 500 seconds for parsing / analyzing the data. Using a profiler I can see that most of the time is wasted reading each line - method readline. I plan to make some changes on the code tonight and see if I can find a way to read it faster, but I thought I should also post it here in case you have any tips. I'm also very inexperienced with using profilers so I hope I read the output correctly :)

Regards,

George

George Sofianos

6:16 a.m.

Hi Christian,

On 1/15/19 12:43 PM, Christian Grün wrote:

...

What are your experiences with using a single thread? If memory consumption is too exhaustive, you could play with the window clause of the FLWOR expression [2,3]. It takes some time to explore the full magic of this XQuery 3.0 extension (the syntax is somewhat verbose), but it’s often a good alternative to complex functional code.

Using a single thread looks to be OK too, about 10k lines per second, and I'm not sure reading the same file with 16 threads (on SSD) is the way to go from an I/O point of view. Searching on stackoverflow there are many suggestions on how to read a file with one or multiple threads e.g [1]

I immediately return the data I need for each line (a small string for example) so the memory consumption is low, I have provided 12GB but I never see over 2-3GB of memory usage. My initial thoughts were that maybe garbage collection was causing delays but after profiling BaseX I don't think this is an issue. It's interesting to know about the window function though, I will certainly find a use for it. While I know most of these functions exist, I can always learn much more about a language. Only yesterday I managed to use fork-join successfully and I think it will save me a lot of time and effort for my use cases. I will post again if I have any updates, thanks again,

George.

[1]: https://stackoverflow.com/questions/40412008/how-to-read-a-file-using-multip...

Christian Grün

6:48 a.m.

Hi George,

I’m glad to announce that files are now processed in an iterative manner [1,2]. That’s something I wanted to try a while ago, and your mail was another motivation to get it done.

It works pretty fine: I reduced the JVM memory to a tiny maximum of 4mb, and I managed to count the line numbers of a file with several gigabytes:

count(file:read-text-lines('huge.txt'))

I’d be interested to hear if your code runs faster with the latest snapshot. Christian

[1] http://files.basex.org/releases/latest/ [2] https://github.com/BaseXdb/basex/commit/cfb7a7965de85139ec9595a6e79a45d873da...

On Tue, Jan 15, 2019 at 12:16 PM George Sofianos gsf.greece@gmail.com wrote:

...

Hi Christian,

On 1/15/19 12:43 PM, Christian Grün wrote:

...
What are your experiences with using a single thread? If memory consumption is too exhaustive, you could play with the window clause of the FLWOR expression [2,3]. It takes some time to explore the full magic of this XQuery 3.0 extension (the syntax is somewhat verbose), but it’s often a good alternative to complex functional code.

Using a single thread looks to be OK too, about 10k lines per second, and I'm not sure reading the same file with 16 threads (on SSD) is the way to go from an I/O point of view. Searching on stackoverflow there are many suggestions on how to read a file with one or multiple threads e.g [1]

I immediately return the data I need for each line (a small string for example) so the memory consumption is low, I have provided 12GB but I never see over 2-3GB of memory usage. My initial thoughts were that maybe garbage collection was causing delays but after profiling BaseX I don't think this is an issue. It's interesting to know about the window function though, I will certainly find a use for it. While I know most of these functions exist, I can always learn much more about a language. Only yesterday I managed to use fork-join successfully and I think it will save me a lot of time and effort for my use cases. I will post again if I have any updates, thanks again,

George.

George Sofianos

6:57 a.m.

Wow, thanks for your fast response! I will give it a try tonight,

George.

On 1/15/19 1:48 PM, Christian Grün wrote:

...

Hi George,

I’m glad to announce that files are now processed in an iterative manner [1,2]. That’s something I wanted to try a while ago, and your mail was another motivation to get it done.

It works pretty fine: I reduced the JVM memory to a tiny maximum of 4mb, and I managed to count the line numbers of a file with several gigabytes:

count(file:read-text-lines('huge.txt'))

I’d be interested to hear if your code runs faster with the latest snapshot. Christian

[1] http://files.basex.org/releases/latest/ [2] https://github.com/BaseXdb/basex/commit/cfb7a7965de85139ec9595a6e79a45d873da...

George Sofianos

7:36 p.m.

Hi Christian,

what I failed to mention last time was that I was using the offset / limit mode of the file:read-text-lines. I never tried to load the whole file into memory with the previous version, because I thought it would be inefficient. I just tried now with the latest snapshot using a single core and while the whole file is being loaded into memory (4GB+), the process completes in about 120 seconds, which is fine for me. Using the offset mode looks to still be more memory efficient (stays around 1-1,3GB), but is very slow (both single core and multi core).

One issue, I can't make the non offset version work with fork-join. It fills the whole memory quickly, so I guess it reads the whole file into memory for each thread(?) - I tried up to 12GB. I've also noticed that in both versions (old and new snapshot), interrupting the fork-join mode will keep the threads running until I manually kill the BaseX process. Maybe I'm doing something wrong, or maybe I'm asking too much from fork-join :) I will try with the window clause tomorrow, maybe it will help. I'm posting an example of my code to help explain better my use case. For now, it is fine because I'm only reading a 4GB file, but potentially I might have to read up to 200GB files so having multi-core capabilities will help.

let $data := file:read-text-lines($file, "UTF-8", false()) let $count := count($data)

let $all := xquery:fork-join( for $i in $data return function() { parse-json($i)?('object1')?*?('object2')?('object3') } ) return distinct-values($all)

Regards,

George

On 1/15/19 1:48 PM, Christian Grün wrote:

...

Hi George,

I’m glad to announce that files are now processed in an iterative manner [1,2]. That’s something I wanted to try a while ago, and your mail was another motivation to get it done.

It works pretty fine: I reduced the JVM memory to a tiny maximum of 4mb, and I managed to count the line numbers of a file with several gigabytes:

count(file:read-text-lines('huge.txt'))

I’d be interested to hear if your code runs faster with the latest snapshot. Christian

[1] http://files.basex.org/releases/latest/ [2] https://github.com/BaseXdb/basex/commit/cfb7a7965de85139ec9595a6e79a45d873da...

Christian Grün

16 Jan 16 Jan

6:37 a.m.

...

Using the offset mode looks to still be more memory efficient (stays around 1-1,3GB), but is very slow (both single core and multi core).

A general note on xquery:fork-join (you may be aware of that anyway): While it may sound enticing to use the function for as many jobs as possible, it is often slower than clever single-core processing. The reason is that (in your case) a file will be accessed by several competing threads (which leads to random access I/O patterns, which are difficult to schedule for the OS, even with SSDs). Even in Java programming, code is often faster that doesn’t use the Java 8 streaming features and, instead, relies on the internal JVM optimizations for distributing atomic operations to multiple cores.

...

let $data := file:read-text-lines($file, "UTF-8", false()) let $count := count($data)

let $all := xquery:fork-join( for $i in $data return function() { parse-json($i)?('object1')?*?('object2')?('object3') } )

This code will potentially create thousands or millions of Java threads. Maybe you are getting better results by splitting your input into 4 or 8 parts, and process each part in a dedicated function.

I would indeed assume that the following code…

distinct-values( for $line in file:read-text-lines($file, "UTF-8", false()) return parse-json($line)?('object1')?*?('object2')?('object3') )

…will be most efficient, even if you process files of 100 GB or more (especially with the new, iterative approach).

George Sofianos

4:34 p.m.

Just posting to say I'm having a lot of fun with the updated read-text-lines function.

On 1/16/19 1:37 PM, Christian Grün wrote:

...

This code will potentially create thousands or millions of Java threads. Maybe you are getting better results by splitting your input into 4 or 8 parts, and process each part in a dedicated function.

I refactored the code to the following, and it completes in 60 seconds, of which 20 are for counting the lines and only 40 seconds for parsing and returning the correct data!!! So I get a 3x improvement from multiple threads. I have no idea if it stresses the SSD at all.

let $file := "/path/to/large.txt" let $count := prof:time(count(file:read-text-lines($file, "UTF-8", false())), "COUNTING: ")

let $cpus := 15 let $parts := ($count div $cpus) => xs:integer() => trace("PER CORE: ")

let $all := xquery:fork-join( for $cpu in 0 to $cpus return function() { let $offset := $cpu * $parts let $length := $parts

for $line in file:read-text-lines($file, "UTF-8", false(), $offset, $length) return parse-json($line)?('obj1')?*?('obj2')?('obj3') }) => prof:time("CALCULATING: ") return distinct-values($all)

...

I would indeed assume that the following code…

distinct-values( for $line in file:read-text-lines($file, "UTF-8", false()) return parse-json($line)?('object1')?*?('object2')?('object3') )

…will be most efficient, even if you process files of 100 GB or more (especially with the new, iterative approach).

Indeed, it is also using tiny amounts of memory and completes in the same time (120 seconds) with loading the whole file into memory on a single core :)

George.

George Sofianos

15 Jan 15 Jan

8:16 p.m.

There also looks to be a difference on how the read-text-lines is used. The following similar queries produce different Query paths, and have different memory usage. This is probably why I can't benefit from the update on more complex queries.

1) return count(file:read-text-lines($file, "UTF-8", false()))

Memory usage - about 20 megabytes

Query path:

<QueryPlan compiled="true" updating="false"> <FnCount name="count(items)" type="xs:integer" size="1"> <FileReadTextLines name="read-text-lines(path[,encoding[,fallback[,offset[,length]]]])" type="xs:string*"> <Str type="xs:string">/home/lumiel/eworx/betmechs/bme/webservice/samples/betfair/September-2015/output.json</Str> <Str type="xs:string">UTF-8</Str> <Bln type="xs:boolean">false</Bln> </FileReadTextLines> </FnCount> </QueryPlan>

2) let $data := file:read-text-lines($file, "UTF-8", false()) return count($data)

Memory usage: 4.5GB

Query path:

<QueryPlan compiled="true" updating="false"> <GFLWOR type="xs:integer" size="1"> <Let type="xs:string*"> <Var name="$data" id="1" type="xs:string*"/> <FileReadTextLines name="read-text-lines(path[,encoding[,fallback[,offset[,length]]]])" type="xs:string*"> <Str type="xs:string">/full/path/file.txt</Str> <Str type="xs:string">UTF-8</Str> <Bln type="xs:boolean">false</Bln> </FileReadTextLines> </Let> <FnCount name="count(items)" type="xs:integer" size="1"> <VarRef type="xs:string*"> <Var name="$data" id="1" type="xs:string*"/> </VarRef> </FnCount> </GFLWOR> </QueryPlan>

On 1/15/19 1:48 PM, Christian Grün wrote:

...

Hi George,

I’m glad to announce that files are now processed in an iterative manner [1,2]. That’s something I wanted to try a while ago, and your mail was another motivation to get it done.

It works pretty fine: I reduced the JVM memory to a tiny maximum of 4mb, and I managed to count the line numbers of a file with several gigabytes:

count(file:read-text-lines('huge.txt'))

I’d be interested to hear if your code runs faster with the latest snapshot. Christian

[1] http://files.basex.org/releases/latest/ [2] https://github.com/BaseXdb/basex/commit/cfb7a7965de85139ec9595a6e79a45d873da...

Christian Grün

16 Jan 16 Jan

6:41 a.m.

The reason for that: file:read-text-lines is a non-deterministic function. Each invocation might yield different results (as the file contents may change in the background). This is different with non-deterministic function calls, such as fn:doc('abc.xml'). If you call such a function repreatedly, it will always access the same document, which has been opened and parsed by the first call of this function.

...

return count(file:read-text-lines($file, "UTF-8", false()))

Here, file processing will be iterative.

...

let $data := file:read-text-lines($file, "UTF-8", false()) return count($data)

The file contents will be bound to $data, and counted in a second step. If the expression of your let clause was deterministic, the variable would be inlined, and the resulting query plan would be identical to the one of your first query.

George Sofianos

7:07 a.m.

Thanks Christian, I will check the code examples you posted tonight, your explanation makes it easier to understand.

I can see there is a list with the deterministic functions in the specs [1] but not so sure about the BaseX specific functions. Is it possible to know if a function is deterministic or not?

I tried file:read-text-lines("/path.txt") is file:read-text-lines("/path.txt") but it doesn't work.

George.

[1] - https://www.w3.org/TR/xpath-functions-31/#dt-deterministic

On 1/16/19 1:41 PM, Christian Grün wrote:

...

The reason for that: file:read-text-lines is a non-deterministic function. Each invocation might yield different results (as the file contents may change in the background). This is different with non-deterministic function calls, such as fn:doc('abc.xml'). If you call such a function repreatedly, it will always access the same document, which has been opened and parsed by the first call of this function.

...

return count(file:read-text-lines($file, "UTF-8", false()))

Here, file processing will be iterative.

...

let $data := file:read-text-lines($file, "UTF-8", false()) return count($data)

The file contents will be bound to $data, and counted in a second step. If the expression of your let clause was deterministic, the variable would be inlined, and the resulting query plan would be identical to the one of your first query.

Christian Grün

7:11 a.m.

...

I can see there is a list with the deterministic functions in the specs [1] but not so sure about the BaseX specific functions. Is it possible to know if a function is deterministic or not?

You can have a look into the appropriate Java class, and check for functions tagged with NDT:

https://github.com/BaseXdb/basex/blob/master/basex-core/src/main/java/org/ba...

If we have some more time, we might include this information in our Wiki.

...

I tried file:read-text-lines("/path.txt") is file:read-text-lines("/path.txt") but it doesn't work.

The "is" operator operates only on nodes.

2375

Age (days ago)

2376

Last active (days ago)

basex-talk@mailman.uni-konstanz.de

11 comments

2 participants

tags (0)

participants (2)

Christian Grün
George Sofianos