running in parallel

List overview All Threads
Download

newer

older

Xslt:processor() and...

Weird: mixed content trimmed...

celano＠informatik.uni-leipzig.de

8 Dec 2019 8 Dec '19

10:48 a.m.

Hi,

I am trying to run two BaseX scripts in parallel using:

xquery:fork-join( ( function() {xquery:eval(xs:anyURI('extract_from_ocr1.xq'))} , function (){xquery:eval(xs:anyURI('extract_from_ocr2.xq'))} ) )

As far as I can understand (read below), the scripts are kind of run in parallel, but still the time benefit of that does not seem much in comparison with running in sequence (~25s vs ~28s). The files contain the same function, which reads files from a directory, performs some calculation, and saves the result in a file (the two scripts work on different directories). I infer that the previous script is run in parallel because the files for the results are created at the same time.

I tried to do the same with GNU parallel, and in that case the files are actually run in parallel.

Do we know why the execution time is not (more or less) halved in BaseX? Thanks.

Ciao, Giuseppe

Show replies by date

Markus Wittenberg

8 Dec 8 Dec

11:04 a.m.

Hi Giuseppe,

as long as the files are not on physically different disks, you will have the two functions block each other with read and write operations. And BaseX runs lots of code in parallel without you explicitly telling it so.

Best regards,

Markus

Am 08.12.2019 um 16:48 schrieb celano@informatik.uni-leipzig.de:

...

Hi,

I am trying to run two BaseX scripts in parallel using:

xquery:fork-join( ( function() {xquery:eval(xs:anyURI('extract_from_ocr1.xq'))} , function (){xquery:eval(xs:anyURI('extract_from_ocr2.xq'))} ) )

As far as I can understand (read below), the scripts are kind of run in parallel, but still the time benefit of that does not seem much in comparison with running in sequence (~25s vs ~28s). The files contain the same function, which reads files from a directory, performs some calculation, and saves the result in a file (the two scripts work on different directories). I infer that the previous script is run in parallel because the files for the results are created at the same time.

I tried to do the same with GNU parallel, and in that case the files are actually run in parallel.

Do we know why the execution time is not (more or less) halved in BaseX? Thanks.

Ciao, Giuseppe

-- Markus Wittenberg Tel +49 (0)341 248 475 36 Mail wittenberg@axxepta.de ---- axxepta solutions GmbH Lehmgrubenweg 17, 88131 Lindau Amtsgericht Berlin HRB 97544B Geschäftsführer: Karsten Becke, Maximilian Gärber

Omar Siam

9 Dec 9 Dec

10:58 a.m.

Hi,

I see the same in my application. My two cent of wisdom: I would say most disks today will be fast enough to mask this problem. Let alone SSDs that can happily fetch two files at the (almost) same time. But the thing is: The exist code uses some pretty heavy locks to make sure no two Java threads access the same (database) file at the same time. And unless this is really given some thought for data safety I am glad that it does not allow queries to run in parallel. I would love to solve this in a more state of the art way but got burned in the past by multi threading. So I have great respect for any good, safe and fast implementation multi threading file access implementation. I fear no one did one yet for BaseX.

Best regards

Omar Siam

Am 08.12.2019 um 17:04 schrieb Markus Wittenberg:

...

Hi Giuseppe,

as long as the files are not on physically different disks, you will have the two functions block each other with read and write operations. And BaseX runs lots of code in parallel without you explicitly telling it so.

Best regards,

Markus

Am 08.12.2019 um 16:48 schrieb celano@informatik.uni-leipzig.de:

...
Hi,

I am trying to run two BaseX scripts in parallel using:

xquery:fork-join( ( function() {xquery:eval(xs:anyURI('extract_from_ocr1.xq'))} , function (){xquery:eval(xs:anyURI('extract_from_ocr2.xq'))} ) )

As far as I can understand (read below), the scripts are kind of run in parallel, but still the time benefit of that does not seem much in comparison with running in sequence (~25s vs ~28s). The files contain the same function, which reads files from a directory, performs some calculation, and saves the result in a file (the two scripts work on different directories). I infer that the previous script is run in parallel because the files for the results are created at the same time.

I tried to do the same with GNU parallel, and in that case the files are actually run in parallel.

Do we know why the execution time is not (more or less) halved in BaseX? Thanks.

Ciao, Giuseppe

Giuseppe G. A. Celano

12:54 p.m.

Thanks for your answers!

I have run an experiment, and I confirm that fork-join() actually works, even if the gain is not as expected. Most importantly, I noticed that the amount of RAM made available is crucial: with 2MB the sequential script was very slow, while with 5/7MB it works fine.

(2,8 GHz Quad-Core Intel Core i7)

Sequential: about 47 s. fork-join: about 40 s. GNU parallel: about 30 s.

Best, Giuseppe

...

On 9. Dec 2019, at 16:58, Omar Siam Omar.Siam@oeaw.ac.at wrote:

Hi,

I see the same in my application. My two cent of wisdom: I would say most disks today will be fast enough to mask this problem. Let alone SSDs that can happily fetch two files at the (almost) same time. But the thing is: The exist code uses some pretty heavy locks to make sure no two Java threads access the same (database) file at the same time. And unless this is really given some thought for data safety I am glad that it does not allow queries to run in parallel. I would love to solve this in a more state of the art way but got burned in the past by multi threading. So I have great respect for any good, safe and fast implementation multi threading file access implementation. I fear no one did one yet for BaseX.

Best regards

Omar Siam

Am 08.12.2019 um 17:04 schrieb Markus Wittenberg:

...
Hi Giuseppe,

as long as the files are not on physically different disks, you will have the two functions block each other with read and write operations. And BaseX runs lots of code in parallel without you explicitly telling it so.

Best regards,

Markus

Am 08.12.2019 um 16:48 schrieb celano@informatik.uni-leipzig.de:

...
Hi,

I am trying to run two BaseX scripts in parallel using:

xquery:fork-join( ( function() {xquery:eval(xs:anyURI('extract_from_ocr1.xq'))} , function (){xquery:eval(xs:anyURI('extract_from_ocr2.xq'))} ) )

As far as I can understand (read below), the scripts are kind of run in parallel, but still the time benefit of that does not seem much in comparison with running in sequence (~25s vs ~28s). The files contain the same function, which reads files from a directory, performs some calculation, and saves the result in a file (the two scripts work on different directories). I infer that the previous script is run in parallel because the files for the results are created at the same time.

I tried to do the same with GNU parallel, and in that case the files are actually run in parallel.

Do we know why the execution time is not (more or less) halved in BaseX? Thanks.

Ciao, Giuseppe

Giuseppe G. A. Celano

3:23 p.m.

I forgot to mention that I often use fork-join() with proc:execute() to run in parallel more than one instance of the OCR engine “tesseract” (I can have more than 1000 images to OCR): it works fabulously (up to 8 processes on my Quad-core Intel Core i7). More in general, when it comes to running system programs, fork-join() + proc:execute() is extremely useful.

Giuseppe

...

On 9. Dec 2019, at 18:54, Giuseppe G. A. Celano celano@informatik.uni-leipzig.de wrote:

Thanks for your answers!

I have run an experiment, and I confirm that fork-join() actually works, even if the gain is not as expected. Most importantly, I noticed that the amount of RAM made available is crucial: with 2MB the sequential script was very slow, while with 5/7MB it works fine.

(2,8 GHz Quad-Core Intel Core i7)

Sequential: about 47 s. fork-join: about 40 s. GNU parallel: about 30 s.

Best, Giuseppe

...
On 9. Dec 2019, at 16:58, Omar Siam Omar.Siam@oeaw.ac.at wrote:

Hi,

I see the same in my application. My two cent of wisdom: I would say most disks today will be fast enough to mask this problem. Let alone SSDs that can happily fetch two files at the (almost) same time. But the thing is: The exist code uses some pretty heavy locks to make sure no two Java threads access the same (database) file at the same time. And unless this is really given some thought for data safety I am glad that it does not allow queries to run in parallel. I would love to solve this in a more state of the art way but got burned in the past by multi threading. So I have great respect for any good, safe and fast implementation multi threading file access implementation. I fear no one did one yet for BaseX.

Best regards

Omar Siam

Am 08.12.2019 um 17:04 schrieb Markus Wittenberg:

...
Hi Giuseppe,

as long as the files are not on physically different disks, you will have the two functions block each other with read and write operations. And BaseX runs lots of code in parallel without you explicitly telling it so.

Best regards,

Markus

Am 08.12.2019 um 16:48 schrieb celano@informatik.uni-leipzig.de:

...
Hi,

I am trying to run two BaseX scripts in parallel using:

xquery:fork-join( ( function() {xquery:eval(xs:anyURI('extract_from_ocr1.xq'))} , function (){xquery:eval(xs:anyURI('extract_from_ocr2.xq'))} ) )

As far as I can understand (read below), the scripts are kind of run in parallel, but still the time benefit of that does not seem much in comparison with running in sequence (~25s vs ~28s). The files contain the same function, which reads files from a directory, performs some calculation, and saves the result in a file (the two scripts work on different directories). I infer that the previous script is run in parallel because the files for the results are created at the same time.

I tried to do the same with GNU parallel, and in that case the files are actually run in parallel.

Do we know why the execution time is not (more or less) halved in BaseX? Thanks.

Ciao, Giuseppe

Christian Grün

10:57 a.m.

Hi Giuseppe,

Maybe you can consult the mailing list archive for more information on concurrent/parallel query processing; there has been a lot of discussion in the past (e.g. [1]).

Cheers, Christian

[1] https://www.mail-archive.com/basex-talk@mailman.uni-konstanz.de/msg12080.htm...

On Sun, Dec 8, 2019 at 4:49 PM celano@informatik.uni-leipzig.de wrote:

...

Hi,

I am trying to run two BaseX scripts in parallel using:

xquery:fork-join( ( function() {xquery:eval(xs:anyURI('extract_from_ocr1.xq'))} , function (){xquery:eval(xs:anyURI('extract_from_ocr2.xq'))} ) )

As far as I can understand (read below), the scripts are kind of run in parallel, but still the time benefit of that does not seem much in comparison with running in sequence (~25s vs ~28s). The files contain the same function, which reads files from a directory, performs some calculation, and saves the result in a file (the two scripts work on different directories). I infer that the previous script is run in parallel because the files for the results are created at the same time.

I tried to do the same with GNU parallel, and in that case the files are actually run in parallel.

Do we know why the execution time is not (more or less) halved in BaseX? Thanks.

Ciao, Giuseppe

2048

Age (days ago)

2049

Last active (days ago)

basex-talk@mailman.uni-konstanz.de

5 comments

5 participants

tags (0)

participants (5)

celano＠informatik.uni-leipzig.de
Christian Grün
Giuseppe G. A. Celano
Markus Wittenberg
Omar Siam