Hi,
I am trying to run two BaseX scripts in parallel using:
xquery:fork-join( ( function() {xquery:eval(xs:anyURI('extract_from_ocr1.xq'))} , function (){xquery:eval(xs:anyURI('extract_from_ocr2.xq'))} ) )
As far as I can understand (read below), the scripts are kind of run in parallel, but still the time benefit of that does not seem much in comparison with running in sequence (~25s vs ~28s). The files contain the same function, which reads files from a directory, performs some calculation, and saves the result in a file (the two scripts work on different directories). I infer that the previous script is run in parallel because the files for the results are created at the same time.
I tried to do the same with GNU parallel, and in that case the files are actually run in parallel.
Do we know why the execution time is not (more or less) halved in BaseX? Thanks.
Ciao, Giuseppe
Hi Giuseppe,
as long as the files are not on physically different disks, you will have the two functions block each other with read and write operations. And BaseX runs lots of code in parallel without you explicitly telling it so.
Best regards,
Markus
Am 08.12.2019 um 16:48 schrieb celano@informatik.uni-leipzig.de:
Hi,
I am trying to run two BaseX scripts in parallel using:
xquery:fork-join( ( function() {xquery:eval(xs:anyURI('extract_from_ocr1.xq'))} , function (){xquery:eval(xs:anyURI('extract_from_ocr2.xq'))} ) )
As far as I can understand (read below), the scripts are kind of run in parallel, but still the time benefit of that does not seem much in comparison with running in sequence (~25s vs ~28s). The files contain the same function, which reads files from a directory, performs some calculation, and saves the result in a file (the two scripts work on different directories). I infer that the previous script is run in parallel because the files for the results are created at the same time.
I tried to do the same with GNU parallel, and in that case the files are actually run in parallel.
Do we know why the execution time is not (more or less) halved in BaseX? Thanks.
Ciao, Giuseppe
Hi,
I see the same in my application. My two cent of wisdom: I would say most disks today will be fast enough to mask this problem. Let alone SSDs that can happily fetch two files at the (almost) same time. But the thing is: The exist code uses some pretty heavy locks to make sure no two Java threads access the same (database) file at the same time. And unless this is really given some thought for data safety I am glad that it does not allow queries to run in parallel. I would love to solve this in a more state of the art way but got burned in the past by multi threading. So I have great respect for any good, safe and fast implementation multi threading file access implementation. I fear no one did one yet for BaseX.
Best regards
Omar Siam
Am 08.12.2019 um 17:04 schrieb Markus Wittenberg:
Hi Giuseppe,
as long as the files are not on physically different disks, you will have the two functions block each other with read and write operations. And BaseX runs lots of code in parallel without you explicitly telling it so.
Best regards,
Markus
Am 08.12.2019 um 16:48 schrieb celano@informatik.uni-leipzig.de:
Hi,
I am trying to run two BaseX scripts in parallel using:
xquery:fork-join( ( function() {xquery:eval(xs:anyURI('extract_from_ocr1.xq'))} , function (){xquery:eval(xs:anyURI('extract_from_ocr2.xq'))} ) )
As far as I can understand (read below), the scripts are kind of run in parallel, but still the time benefit of that does not seem much in comparison with running in sequence (~25s vs ~28s). The files contain the same function, which reads files from a directory, performs some calculation, and saves the result in a file (the two scripts work on different directories). I infer that the previous script is run in parallel because the files for the results are created at the same time.
I tried to do the same with GNU parallel, and in that case the files are actually run in parallel.
Do we know why the execution time is not (more or less) halved in BaseX? Thanks.
Ciao, Giuseppe
Thanks for your answers!
I have run an experiment, and I confirm that fork-join() actually works, even if the gain is not as expected. Most importantly, I noticed that the amount of RAM made available is crucial: with 2MB the sequential script was very slow, while with 5/7MB it works fine.
(2,8 GHz Quad-Core Intel Core i7)
Sequential: about 47 s. fork-join: about 40 s. GNU parallel: about 30 s.
Best, Giuseppe
On 9. Dec 2019, at 16:58, Omar Siam Omar.Siam@oeaw.ac.at wrote:
Hi,
I see the same in my application. My two cent of wisdom: I would say most disks today will be fast enough to mask this problem. Let alone SSDs that can happily fetch two files at the (almost) same time. But the thing is: The exist code uses some pretty heavy locks to make sure no two Java threads access the same (database) file at the same time. And unless this is really given some thought for data safety I am glad that it does not allow queries to run in parallel. I would love to solve this in a more state of the art way but got burned in the past by multi threading. So I have great respect for any good, safe and fast implementation multi threading file access implementation. I fear no one did one yet for BaseX.
Best regards
Omar Siam
Am 08.12.2019 um 17:04 schrieb Markus Wittenberg:
Hi Giuseppe,
as long as the files are not on physically different disks, you will have the two functions block each other with read and write operations. And BaseX runs lots of code in parallel without you explicitly telling it so.
Best regards,
Markus
Am 08.12.2019 um 16:48 schrieb celano@informatik.uni-leipzig.de:
Hi,
I am trying to run two BaseX scripts in parallel using:
xquery:fork-join( ( function() {xquery:eval(xs:anyURI('extract_from_ocr1.xq'))} , function (){xquery:eval(xs:anyURI('extract_from_ocr2.xq'))} ) )
As far as I can understand (read below), the scripts are kind of run in parallel, but still the time benefit of that does not seem much in comparison with running in sequence (~25s vs ~28s). The files contain the same function, which reads files from a directory, performs some calculation, and saves the result in a file (the two scripts work on different directories). I infer that the previous script is run in parallel because the files for the results are created at the same time.
I tried to do the same with GNU parallel, and in that case the files are actually run in parallel.
Do we know why the execution time is not (more or less) halved in BaseX? Thanks.
Ciao, Giuseppe
I forgot to mention that I often use fork-join() with proc:execute() to run in parallel more than one instance of the OCR engine “tesseract” (I can have more than 1000 images to OCR): it works fabulously (up to 8 processes on my Quad-core Intel Core i7). More in general, when it comes to running system programs, fork-join() + proc:execute() is extremely useful.
Giuseppe
On 9. Dec 2019, at 18:54, Giuseppe G. A. Celano celano@informatik.uni-leipzig.de wrote:
Thanks for your answers!
I have run an experiment, and I confirm that fork-join() actually works, even if the gain is not as expected. Most importantly, I noticed that the amount of RAM made available is crucial: with 2MB the sequential script was very slow, while with 5/7MB it works fine.
(2,8 GHz Quad-Core Intel Core i7)
Sequential: about 47 s. fork-join: about 40 s. GNU parallel: about 30 s.
Best, Giuseppe
On 9. Dec 2019, at 16:58, Omar Siam Omar.Siam@oeaw.ac.at wrote:
Hi,
I see the same in my application. My two cent of wisdom: I would say most disks today will be fast enough to mask this problem. Let alone SSDs that can happily fetch two files at the (almost) same time. But the thing is: The exist code uses some pretty heavy locks to make sure no two Java threads access the same (database) file at the same time. And unless this is really given some thought for data safety I am glad that it does not allow queries to run in parallel. I would love to solve this in a more state of the art way but got burned in the past by multi threading. So I have great respect for any good, safe and fast implementation multi threading file access implementation. I fear no one did one yet for BaseX.
Best regards
Omar Siam
Am 08.12.2019 um 17:04 schrieb Markus Wittenberg:
Hi Giuseppe,
as long as the files are not on physically different disks, you will have the two functions block each other with read and write operations. And BaseX runs lots of code in parallel without you explicitly telling it so.
Best regards,
Markus
Am 08.12.2019 um 16:48 schrieb celano@informatik.uni-leipzig.de:
Hi,
I am trying to run two BaseX scripts in parallel using:
xquery:fork-join( ( function() {xquery:eval(xs:anyURI('extract_from_ocr1.xq'))} , function (){xquery:eval(xs:anyURI('extract_from_ocr2.xq'))} ) )
As far as I can understand (read below), the scripts are kind of run in parallel, but still the time benefit of that does not seem much in comparison with running in sequence (~25s vs ~28s). The files contain the same function, which reads files from a directory, performs some calculation, and saves the result in a file (the two scripts work on different directories). I infer that the previous script is run in parallel because the files for the results are created at the same time.
I tried to do the same with GNU parallel, and in that case the files are actually run in parallel.
Do we know why the execution time is not (more or less) halved in BaseX? Thanks.
Ciao, Giuseppe
Hi Giuseppe,
Maybe you can consult the mailing list archive for more information on concurrent/parallel query processing; there has been a lot of discussion in the past (e.g. [1]).
Cheers, Christian
[1] https://www.mail-archive.com/basex-talk@mailman.uni-konstanz.de/msg12080.htm...
On Sun, Dec 8, 2019 at 4:49 PM celano@informatik.uni-leipzig.de wrote:
Hi,
I am trying to run two BaseX scripts in parallel using:
xquery:fork-join( ( function() {xquery:eval(xs:anyURI('extract_from_ocr1.xq'))} , function (){xquery:eval(xs:anyURI('extract_from_ocr2.xq'))} ) )
As far as I can understand (read below), the scripts are kind of run in parallel, but still the time benefit of that does not seem much in comparison with running in sequence (~25s vs ~28s). The files contain the same function, which reads files from a directory, performs some calculation, and saves the result in a file (the two scripts work on different directories). I infer that the previous script is run in parallel because the files for the results are created at the same time.
I tried to do the same with GNU parallel, and in that case the files are actually run in parallel.
Do we know why the execution time is not (more or less) halved in BaseX? Thanks.
Ciao, Giuseppe
basex-talk@mailman.uni-konstanz.de