Hi All, I have hierarchical information encoded an XML document that looks something like: <organization entityID=“1”> <organization entityID=“2"> <parent entityID=“1”> </organization> <organization entityID=“3"> <parent entityID=“2”> </organization> <organization entityID=“4"> <parent entityID=“1”> </organization>
There are around 80,000 entries like this and I need to regular do extractions of sub-hiearchies (see the commented out version of the below function).
The commented-out version runs in about a minute using a modest amount of memory.
Hoping to take advantage of this https://github.com/BaseXdb/basex/commit/ac86bbbc3cd1f71461ce94d803cab46f21e7eae7, I modified the function (the uncommented version) which tries to use xquery:fork-join in 8.5.2beta (the July 12th snapshot).
The parallelized one chews up the 3g of available memory, unceremoniously throws exceptions (Exception in thread "qtp198198276-19” ), with the occasional: java.lang.OutOfMemoryError: GC overhead limit exceeded at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.addConditionWaiter(AbstractQueuedSynchronizer.java:1857) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2073) Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "qtp198198276-19" java.lang.OutOfMemoryError: GC overhead limit exceeded Exception in thread "qtp198198276-14" java.lang.OutOfMemoryError: GC overhead limit exceeded and runs for tens of minutes (perhaps more - I always kill the process).
Any ideas on what I can do to improve the situation?
Thanks in advance.
Cheers, -carl
declare function csd_bl:get_child_orgs($orgs,$org) { let $org_id := $org/@entityID
return if (functx:all-whitespace($org_id)) then () else let $c_orgs := $orgs[./parent[@entityID = $org_id]] let $t0 := trace($org_id, "creating func for ") let $t1 := trace(count($c_orgs), " func checks children: ") let $c_org_funcs:= for $c_org in $c_orgs return function() { ( trace($org_id, "executing child func for ") , $c_org, csd_bl:get_child_orgs($orgs,$c_org))} return xquery:fork-join($c_org_funcs)
(: let $c_orgs := if (functx:all-whitespace($org_id)) then () else $orgs[./parent[@entityID = $org_id]] return for $c_org in $c_orgs let $t0 := trace($org_id, "processing children for ") return ($c_org,csd_bl:get_child_orgs($orgs,$c_org)) :) };
Hi Carl,
The parallelized one chews up the 3g of available memory, unceremoniously throws exceptions (Exception in thread "qtp198198276-19” ), with the occasional:
My assumption is that you are creating a huge number of functions to be evaluated in parallel; have you already counted them?
Cheers Christian
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2073)
Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "qtp198198276-19" java.lang.OutOfMemoryError: GC overhead limit exceeded Exception in thread "qtp198198276-14" java.lang.OutOfMemoryError: GC overhead limit exceeded
and runs for tens of minutes (perhaps more - I always kill the process).
Any ideas on what I can do to improve the situation?
Thanks in advance.
Cheers, -carl
declare function csd_bl:get_child_orgs($orgs,$org) { let $org_id := $org/@entityID
return if (functx:all-whitespace($org_id)) then () else let $c_orgs := $orgs[./parent[@entityID = $org_id]] let $t0 := trace($org_id, "creating func for ") let $t1 := trace(count($c_orgs), " func checks children: ") let $c_org_funcs:= for $c_org in $c_orgs return function() { ( trace($org_id, "executing child func for ") , $c_org, csd_bl:get_child_orgs($orgs,$c_org))} return xquery:fork-join($c_org_funcs)
(: let $c_orgs := if (functx:all-whitespace($org_id)) then () else $orgs[./parent[@entityID = $org_id]] return for $c_org in $c_orgs let $t0 := trace($org_id, "processing children for ") return ($c_org,csd_bl:get_child_orgs($orgs,$c_org)) :) };
In this case it is in the neighborhood of 200 which isn’t too big. In another case, it would be on the order of 17,000 total. It’s not creating them all at once - only as it walks the hierarchy whose depth is a maximum of 8.
If there are other ideas as to how to optimize or parallelize this type of query, I would be happy to hear.
Cheers, -carl
On Jul 15, 2016, at 2:24 PM, Christian Grün christian.gruen@gmail.com wrote:
Hi Carl,
The parallelized one chews up the 3g of available memory, unceremoniously throws exceptions (Exception in thread "qtp198198276-19” ), with the occasional:
My assumption is that you are creating a huge number of functions to be evaluated in parallel; have you already counted them?
Cheers Christian
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2073)
Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "qtp198198276-19" java.lang.OutOfMemoryError: GC overhead limit exceeded Exception in thread "qtp198198276-14" java.lang.OutOfMemoryError: GC overhead limit exceeded
and runs for tens of minutes (perhaps more - I always kill the process).
Any ideas on what I can do to improve the situation?
Thanks in advance.
Cheers, -carl
declare function csd_bl:get_child_orgs($orgs,$org) { let $org_id := $org/@entityID
return if (functx:all-whitespace($org_id)) then () else let $c_orgs := $orgs[./parent[@entityID = $org_id]] let $t0 := trace($org_id, "creating func for ") let $t1 := trace(count($c_orgs), " func checks children: ") let $c_org_funcs:= for $c_org in $c_orgs return function() { ( trace($org_id, "executing child func for ") , $c_org, csd_bl:get_child_orgs($orgs,$c_org))} return xquery:fork-join($c_org_funcs)
(: let $c_orgs := if (functx:all-whitespace($org_id)) then () else $orgs[./parent[@entityID = $org_id]] return for $c_org in $c_orgs let $t0 := trace($org_id, "processing children for ") return ($c_org,csd_bl:get_child_orgs($orgs,$c_org)) :) };
I guess I simply have too less information on your data and query. Do you think there’s any chance to generate a self-contained example?
I wrote a little example query. It does something completely different than yours, but it generally shows that the parallelized evaluation of 10000 functions is no problem (on my machine, with 4 cores, the following query takes around 800 ms):
let $xml := <xml> <organization entityID="1"/> <organization entityID="2"> <parent entityID="1"/> </organization> <organization entityID="3"> <parent entityID="2"/> </organization> <organization entityID="4"> <parent entityID="1"/> </organization> </xml> let $f := function() { for $i in 1 to 100 return count($xml/*/*/@*/../..) } return sum( xquery:fork-join((1 to 10000) ! $f) )
On Fri, Jul 15, 2016 at 8:30 PM, Carl Leitner litlfred@ibiblio.org wrote:
In this case it is in the neighborhood of 200 which isn’t too big. In another case, it would be on the order of 17,000 total. It’s not creating them all at once - only as it walks the hierarchy whose depth is a maximum of 8.
If there are other ideas as to how to optimize or parallelize this type of query, I would be happy to hear.
Cheers, -carl
On Jul 15, 2016, at 2:24 PM, Christian Grün christian.gruen@gmail.com wrote:
Hi Carl,
The parallelized one chews up the 3g of available memory, unceremoniously throws exceptions (Exception in thread "qtp198198276-19” ), with the occasional:
My assumption is that you are creating a huge number of functions to be evaluated in parallel; have you already counted them?
Cheers Christian
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2073)
Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "qtp198198276-19" java.lang.OutOfMemoryError: GC overhead limit exceeded Exception in thread "qtp198198276-14" java.lang.OutOfMemoryError: GC overhead limit exceeded
and runs for tens of minutes (perhaps more - I always kill the process).
Any ideas on what I can do to improve the situation?
Thanks in advance.
Cheers, -carl
declare function csd_bl:get_child_orgs($orgs,$org) { let $org_id := $org/@entityID
return if (functx:all-whitespace($org_id)) then () else let $c_orgs := $orgs[./parent[@entityID = $org_id]] let $t0 := trace($org_id, "creating func for ") let $t1 := trace(count($c_orgs), " func checks children: ") let $c_org_funcs:= for $c_org in $c_orgs return function() { ( trace($org_id, "executing child func for ") , $c_org, csd_bl:get_child_orgs($orgs,$c_org))} return xquery:fork-join($c_org_funcs)
(: let $c_orgs := if (functx:all-whitespace($org_id)) then () else $orgs[./parent[@entityID = $org_id]] return for $c_org in $c_orgs let $t0 := trace($org_id, "processing children for ") return ($c_org,csd_bl:get_child_orgs($orgs,$c_org)) :) };
Hi, Here is a small of an example as I can get: https://github.com/litlfred/extractor
The source data set is organizations.xml.
There is a module that should be loaded extractor.xqm and then two scripts: extract_hieararchy.xq and extract_hieararchy-forked.xq
I appreciate your help.
Cheers, -carl
On Jul 15, 2016, at 2:40 PM, Christian Grün christian.gruen@gmail.com wrote:
declare function csd_bl:get_child_orgs($orgs,$org) { let $org_id := $org/@entityID
return if (functx:all-whitespace($org_id)) then () else let $c_orgs := $orgs[./parent[@entityID = $org_id]] let $t0 := trace($org_id, "creating func for ") let $t1 := trace(count($c_orgs), " func checks children: ") let $c_org_funcs:= for $c_org in $c_orgs return function() { ( trace($org_id, "executing child func for ") , $c_org, csd_bl:get_child_orgs($orgs,$c_org))} return xquery:fork-join($c_org_funcs)
(: let $c_orgs := if (functx:all-whitespace($org_id)) then () else $orgs[./parent[@entityID = $org_id]] return for $c_org in $c_orgs let $t0 := trace($org_id, "processing children for ") return ($c_org,csd_bl:get_child_orgs($orgs,$c_org)) :) };
If I first load the organizations.xml into the database it takes 25 seconds to run (both before and after I run optimize). If I run the extraction directly against the organizations.xml file on disk it only takes 7 seconds.
Is that to be expected?
Cheers, -carl
On Jul 15, 2016, at 3:19 PM, Carl Leitner litlfred@ibiblio.org wrote:
Hi, Here is a small of an example as I can get: https://github.com/litlfred/extractor https://github.com/litlfred/extractor
The source data set is organizations.xml.
There is a module that should be loaded extractor.xqm and then two scripts: extract_hieararchy.xq and extract_hieararchy-forked.xq
I appreciate your help.
Cheers, -carl
On Jul 15, 2016, at 2:40 PM, Christian Grün <christian.gruen@gmail.com mailto:christian.gruen@gmail.com> wrote:
declare function csd_bl:get_child_orgs($orgs,$org) { let $org_id := $org/@entityID
return if (functx:all-whitespace($org_id)) then () else let $c_orgs := $orgs[./parent[@entityID = $org_id]] let $t0 := trace($org_id, "creating func for ") let $t1 := trace(count($c_orgs), " func checks children: ") let $c_org_funcs:= for $c_org in $c_orgs return function() { ( trace($org_id, "executing child func for ") , $c_org, csd_bl:get_child_orgs($orgs,$c_org))} return xquery:fork-join($c_org_funcs)
(: let $c_orgs := if (functx:all-whitespace($org_id)) then () else $orgs[./parent[@entityID = $org_id]] return for $c_org in $c_orgs let $t0 := trace($org_id, "processing children for ") return ($c_org,csd_bl:get_child_orgs($orgs,$c_org)) :) };
Hi Carl,
I finally had a look at your query. The parallelized variant of your query was not 100% equivalent to the first one. The following version should do the job:
declare function extractor:get_child_orgs-forked($orgs,$org) { for $org_id in $org/@id for $c_orgs in $orgs[parent/@id = $org_id] return xquery:fork-join( for $c_org in $c_orgs return function() { $c_org, extractor:get_child_orgs-forked($orgs, $c_org) } ) };
If I first load the organizations.xml into the database it takes 25 seconds to run (both before and after I run optimize). If I run the extraction directly against the organizations.xml file on disk it only takes 7 seconds.
Is that to be expected?
Yes it is. The reason is that access to a database will always be a bit slower than memory access. You can explicitly convert database to main-memory fragments by using the update keyword:
db:open('organization') update {}
…but that’s only recommendable for smaller fragments and the ones that are frequently accessed.
Cheers Christian
Hi Christian,
Thanks for your help.
Running your version of the query does not exhaust memory as mine had, however I don’t see the CPU usage using more than one slightly more than available processor. Excepting a initial 1 second spike, it’s running at around 101% on a two CPU machine - so not parallelized. If you change the node we are extracting (which will extract 17,000 nodes) in the extract_hierarchy-forked.xq you should be able to see this:
let $node := <organization id="urn:uuid:a0c7c9cb-cdc4-4d24-b644-04dfcd45f9ea"/>
Any ideas?
Cheers, -carl
On Jul 15, 2016, at 4:43 PM, Christian Grün christian.gruen@gmail.com wrote:
Hi Carl,
I finally had a look at your query. The parallelized variant of your query was not 100% equivalent to the first one. The following version should do the job:
declare function extractor:get_child_orgs-forked($orgs,$org) { for $org_id in $org/@id for $c_orgs in $orgs[parent/@id = $org_id] return xquery:fork-join( for $c_org in $c_orgs return function() { $c_org, extractor:get_child_orgs-forked($orgs, $c_org) } ) };
If I first load the organizations.xml into the database it takes 25 seconds to run (both before and after I run optimize). If I run the extraction directly against the organizations.xml file on disk it only takes 7 seconds.
Is that to be expected?
Yes it is. The reason is that access to a database will always be a bit slower than memory access. You can explicitly convert database to main-memory fragments by using the update keyword:
db:open('organization') update {}
…but that’s only recommendable for smaller fragments and the ones that are frequently accessed.
Cheers Christian
Hi Carl,
Running your version of the query does not exhaust memory as mine had, however I don’t see the CPU usage using more than one slightly more than available processor.
Your initial fork query was creating an endless loop. In the current one, I assume that only one function will be created for each xquery:fork-join call. Maybe you need to spend some more time on the question how your code could actually be forked in a recursive way at all?
What you probably want is a xquery:fork() function (without join), but we didn’t add such a function so far because it would be much more difficult to eventually join the results and find a good order.
Cheers Christian
Hmm. Looking at your query: 01: declare function extractor:get_child_orgs-forked($orgs,$org) { 02: for $org_id in $org/@id 03: for $c_orgs in $orgs[parent/@id = $org_id] 04: return xquery:fork-join( 05: for $c_org in $c_orgs 06: return function() { 07: $c_org, extractor:get_child_orgs-forked($orgs, $c_org) 08: } 09: ) 10: };
It seems that $c_orgs as defined in line 03 will be a single element. This means that when you get to line 05 we are looping over a single element and the enclosing fork-join will only be joining a single function/thread. Am I misreading that?
Cheers, -carl
On Jul 15, 2016, at 5:32 PM, Christian Grün christian.gruen@gmail.com wrote:
Hi Carl,
Running your version of the query does not exhaust memory as mine had, however I don’t see the CPU usage using more than one slightly more than available processor.
Your initial fork query was creating an endless loop. In the current one, I assume that only one function will be created for each xquery:fork-join call. Maybe you need to spend some more time on the question how your code could actually be forked in a recursive way at all?
What you probably want is a xquery:fork() function (without join), but we didn’t add such a function so far because it would be much more difficult to eventually join the results and find a good order.
Cheers Christian
It seems that $c_orgs as defined in line 03 will be a single element.
Exactly. Maybe it’s sufficient to rewrite line 3 as follows:
let $c_orgs := $orgs[parent/@id = $org_id] where $c_orgs
On Sat, Jul 16, 2016 at 1:16 AM, Carl Leitner litlfred@ibiblio.org wrote:
Hmm. Looking at your query: 01: declare function extractor:get_child_orgs-forked($orgs,$org) { 02: for $org_id in $org/@id 03: for $c_orgs in $orgs[parent/@id = $org_id] 04: return xquery:fork-join( 05: for $c_org in $c_orgs 06: return function() { 07: $c_org, extractor:get_child_orgs-forked($orgs, $c_org) 08: } 09: ) 10: };
This means that when you get to line 05 we are looping over a single element and the enclosing fork-join will only be joining a single function/thread. Am I misreading that?
Cheers, -carl
On Jul 15, 2016, at 5:32 PM, Christian Grün christian.gruen@gmail.com wrote:
Hi Carl,
Running your version of the query does not exhaust memory as mine had, however I don’t see the CPU usage using more than one slightly more than available processor.
Your initial fork query was creating an endless loop. In the current one, I assume that only one function will be created for each xquery:fork-join call. Maybe you need to spend some more time on the question how your code could actually be forked in a recursive way at all?
What you probably want is a xquery:fork() function (without join), but we didn’t add such a function so far because it would be much more difficult to eventually join the results and find a good order.
Cheers Christian
Hi Carl,
Thanks again for your observation! Only now, I noticed that xquery:fork-join did weird things when specifying an empty sequence as argument.
This has been fixed [1]; without the latest snapshot [2], you can get rid of the "where $c_orgs" clause.
I tried your original fork query, and it now terminates (it should generate the same result as the unparallelized query after removing the trace() within the function, or replacing it with prof:dump).
Hope this helps, Christian
[1] https://github.com/BaseXdb/basex/commit/f7d8744e2760ed1531b48a9c9de92f3694e6... [2] http://files.basex.org/releases/latest
On Sat, Jul 16, 2016 at 8:38 AM, Christian Grün christian.gruen@gmail.com wrote:
It seems that $c_orgs as defined in line 03 will be a single element.
Exactly. Maybe it’s sufficient to rewrite line 3 as follows:
let $c_orgs := $orgs[parent/@id = $org_id] where $c_orgs
On Sat, Jul 16, 2016 at 1:16 AM, Carl Leitner litlfred@ibiblio.org wrote:
Hmm. Looking at your query: 01: declare function extractor:get_child_orgs-forked($orgs,$org) { 02: for $org_id in $org/@id 03: for $c_orgs in $orgs[parent/@id = $org_id] 04: return xquery:fork-join( 05: for $c_org in $c_orgs 06: return function() { 07: $c_org, extractor:get_child_orgs-forked($orgs, $c_org) 08: } 09: ) 10: };
This means that when you get to line 05 we are looping over a single element and the enclosing fork-join will only be joining a single function/thread. Am I misreading that?
Cheers, -carl
On Jul 15, 2016, at 5:32 PM, Christian Grün christian.gruen@gmail.com wrote:
Hi Carl,
Running your version of the query does not exhaust memory as mine had, however I don’t see the CPU usage using more than one slightly more than available processor.
Your initial fork query was creating an endless loop. In the current one, I assume that only one function will be created for each xquery:fork-join call. Maybe you need to spend some more time on the question how your code could actually be forked in a recursive way at all?
What you probably want is a xquery:fork() function (without join), but we didn’t add such a function so far because it would be much more difficult to eventually join the results and find a good order.
Cheers Christian
Hi Christian, Thanks for your help and hints. Things are definitely working now.
Also thanks for the hint about loading the document into memory with update {}. Without that, the fork-join had significantly degraded performance as compared to the non-fork join.
On an further optimization note, I found the overhead of a fork-join was only worthwhile for those cases where there were at least three children of the node we are extracting at.
One last question, do you have an expected timeline for when 8.5.2 will be released?
Again, thanks for your help.
Cheers, -carl
On Jul 16, 2016, at 2:57 AM, Christian Grün christian.gruen@gmail.com wrote:
Hi Carl,
Thanks again for your observation! Only now, I noticed that xquery:fork-join did weird things when specifying an empty sequence as argument.
This has been fixed [1]; without the latest snapshot [2], you can get rid of the "where $c_orgs" clause.
I tried your original fork query, and it now terminates (it should generate the same result as the unparallelized query after removing the trace() within the function, or replacing it with prof:dump).
Hope this helps, Christian
[1] https://github.com/BaseXdb/basex/commit/f7d8744e2760ed1531b48a9c9de92f3694e6... [2] http://files.basex.org/releases/latest
On Sat, Jul 16, 2016 at 8:38 AM, Christian Grün christian.gruen@gmail.com wrote:
It seems that $c_orgs as defined in line 03 will be a single element.
Exactly. Maybe it’s sufficient to rewrite line 3 as follows:
let $c_orgs := $orgs[parent/@id = $org_id] where $c_orgs
On Sat, Jul 16, 2016 at 1:16 AM, Carl Leitner litlfred@ibiblio.org wrote:
Hmm. Looking at your query: 01: declare function extractor:get_child_orgs-forked($orgs,$org) { 02: for $org_id in $org/@id 03: for $c_orgs in $orgs[parent/@id = $org_id] 04: return xquery:fork-join( 05: for $c_org in $c_orgs 06: return function() { 07: $c_org, extractor:get_child_orgs-forked($orgs, $c_org) 08: } 09: ) 10: };
This means that when you get to line 05 we are looping over a single element and the enclosing fork-join will only be joining a single function/thread. Am I misreading that?
Cheers, -carl
On Jul 15, 2016, at 5:32 PM, Christian Grün christian.gruen@gmail.com wrote:
Hi Carl,
Running your version of the query does not exhaust memory as mine had, however I don’t see the CPU usage using more than one slightly more than available processor.
Your initial fork query was creating an endless loop. In the current one, I assume that only one function will be created for each xquery:fork-join call. Maybe you need to spend some more time on the question how your code could actually be forked in a recursive way at all?
What you probably want is a xquery:fork() function (without join), but we didn’t add such a function so far because it would be much more difficult to eventually join the results and find a good order.
Cheers Christian
One last question, do you have an expected timeline for when 8.5.2 will be released?
…the release should be around end of July.
On Jul 16, 2016, at 2:57 AM, Christian Grün christian.gruen@gmail.com wrote:
Hi Carl,
Thanks again for your observation! Only now, I noticed that xquery:fork-join did weird things when specifying an empty sequence as argument.
This has been fixed [1]; without the latest snapshot [2], you can get rid of the "where $c_orgs" clause.
I tried your original fork query, and it now terminates (it should generate the same result as the unparallelized query after removing the trace() within the function, or replacing it with prof:dump).
Hope this helps, Christian
[1] https://github.com/BaseXdb/basex/commit/f7d8744e2760ed1531b48a9c9de92f3694e6... [2] http://files.basex.org/releases/latest
On Sat, Jul 16, 2016 at 8:38 AM, Christian Grün christian.gruen@gmail.com wrote:
It seems that $c_orgs as defined in line 03 will be a single element.
Exactly. Maybe it’s sufficient to rewrite line 3 as follows:
let $c_orgs := $orgs[parent/@id = $org_id] where $c_orgs
On Sat, Jul 16, 2016 at 1:16 AM, Carl Leitner litlfred@ibiblio.org wrote:
Hmm. Looking at your query: 01: declare function extractor:get_child_orgs-forked($orgs,$org) { 02: for $org_id in $org/@id 03: for $c_orgs in $orgs[parent/@id = $org_id] 04: return xquery:fork-join( 05: for $c_org in $c_orgs 06: return function() { 07: $c_org, extractor:get_child_orgs-forked($orgs, $c_org) 08: } 09: ) 10: };
This means that when you get to line 05 we are looping over a single element and the enclosing fork-join will only be joining a single function/thread. Am I misreading that?
Cheers, -carl
On Jul 15, 2016, at 5:32 PM, Christian Grün christian.gruen@gmail.com wrote:
Hi Carl,
Running your version of the query does not exhaust memory as mine had, however I don’t see the CPU usage using more than one slightly more than available processor.
Your initial fork query was creating an endless loop. In the current one, I assume that only one function will be created for each xquery:fork-join call. Maybe you need to spend some more time on the question how your code could actually be forked in a recursive way at all?
What you probably want is a xquery:fork() function (without join), but we didn’t add such a function so far because it would be much more difficult to eventually join the results and find a good order.
Cheers Christian
Hi Christian,
Raising this general thread of conversation again. I seem to be running into some weird issues with recursion and fork-join(). I am getting some non-deterministic behavior running the hierarchy extractor function below. There are no errors of note, but I think maybe some of the threads are silently failing.
Unfortunately I don’t have a SSCCE yet as it is a bit difficult to reproduce. I am wondering if you might have any suggestions on how to debug this and if you are logging any information in case a thread dies. We are running 8.5.2.
Cheers, -carl
On Jul 16, 2016, at 10:49 AM, Carl Leitner litlfred@ibiblio.org wrote:
Hi Christian, Thanks for your help and hints. Things are definitely working now.
Also thanks for the hint about loading the document into memory with update {}. Without that, the fork-join had significantly degraded performance as compared to the non-fork join.
On an further optimization note, I found the overhead of a fork-join was only worthwhile for those cases where there were at least three children of the node we are extracting at.
One last question, do you have an expected timeline for when 8.5.2 will be released?
Again, thanks for your help.
Cheers, -carl
On Jul 16, 2016, at 2:57 AM, Christian Grün christian.gruen@gmail.com wrote:
Hi Carl,
Thanks again for your observation! Only now, I noticed that xquery:fork-join did weird things when specifying an empty sequence as argument.
This has been fixed [1]; without the latest snapshot [2], you can get rid of the "where $c_orgs" clause.
I tried your original fork query, and it now terminates (it should generate the same result as the unparallelized query after removing the trace() within the function, or replacing it with prof:dump).
Hope this helps, Christian
[1] https://github.com/BaseXdb/basex/commit/f7d8744e2760ed1531b48a9c9de92f3694e6... [2] http://files.basex.org/releases/latest
On Sat, Jul 16, 2016 at 8:38 AM, Christian Grün christian.gruen@gmail.com wrote:
It seems that $c_orgs as defined in line 03 will be a single element.
Exactly. Maybe it’s sufficient to rewrite line 3 as follows:
let $c_orgs := $orgs[parent/@id = $org_id] where $c_orgs
On Sat, Jul 16, 2016 at 1:16 AM, Carl Leitner litlfred@ibiblio.org wrote:
Hmm. Looking at your query: 01: declare function extractor:get_child_orgs-forked($orgs,$org) { 02: for $org_id in $org/@id 03: for $c_orgs in $orgs[parent/@id = $org_id] 04: return xquery:fork-join( 05: for $c_org in $c_orgs 06: return function() { 07: $c_org, extractor:get_child_orgs-forked($orgs, $c_org) 08: } 09: ) 10: };
This means that when you get to line 05 we are looping over a single element and the enclosing fork-join will only be joining a single function/thread. Am I misreading that?
Cheers, -carl
On Jul 15, 2016, at 5:32 PM, Christian Grün christian.gruen@gmail.com wrote:
Hi Carl,
Running your version of the query does not exhaust memory as mine had, however I don’t see the CPU usage using more than one slightly more than available processor.
Your initial fork query was creating an endless loop. In the current one, I assume that only one function will be created for each xquery:fork-join call. Maybe you need to spend some more time on the question how your code could actually be forked in a recursive way at all?
What you probably want is a xquery:fork() function (without join), but we didn’t add such a function so far because it would be much more difficult to eventually join the results and find a good order.
Cheers Christian
Hi Carl,
I am wondering if you might have any suggestions on how to debug this and if you are logging any information in case a thread dies.
You could play around with trace() or prof:variables(), or activate debugging via -d or SET DEBUG ON.
We are running 8.5.2.
In BaseX 8.5.3, we have further improved thread-safety for fork-join queries. Could you give it a try?
Hope this helps Christian
Hi, I have encountered another issue. This occurs with the July16 snapshot and occurs when you do a fork-join when the forked functions contain a variable reference to another function. You can find an example here: https://github.com/litlfred/extractor/blob/master/fork-function-var.xq https://github.com/litlfred/extractor/blob/master/fork-function-var.xq Note that this type of fork-join was working with the async module in 8.4
Here are some the error messages I was getting (they would intermittently change): https://gist.github.com/litlfred/93411a892ddb4a5b5efa72130cd94d30 https://gist.github.com/litlfred/93411a892ddb4a5b5efa72130cd94d30 https://gist.github.com/litlfred/93dc73e7b0f050583da7bb79b5e31f5a https://gist.github.com/litlfred/93dc73e7b0f050583da7bb79b5e31f5a https://gist.github.com/litlfred/c546722e78cd4217be943e24eb868df0 https://gist.github.com/litlfred/c546722e78cd4217be943e24eb868df0
Thanks in advance for your help.
Cheers, -carl
On Jul 16, 2016, at 2:57 AM, Christian Grün christian.gruen@gmail.com wrote:
Hi Carl,
Thanks again for your observation! Only now, I noticed that xquery:fork-join did weird things when specifying an empty sequence as argument.
This has been fixed [1]; without the latest snapshot [2], you can get rid of the "where $c_orgs" clause.
I tried your original fork query, and it now terminates (it should generate the same result as the unparallelized query after removing the trace() within the function, or replacing it with prof:dump).
Hope this helps, Christian
[1] https://github.com/BaseXdb/basex/commit/f7d8744e2760ed1531b48a9c9de92f3694e6... [2] http://files.basex.org/releases/latest
On Sat, Jul 16, 2016 at 8:38 AM, Christian Grün christian.gruen@gmail.com wrote:
It seems that $c_orgs as defined in line 03 will be a single element.
Exactly. Maybe it’s sufficient to rewrite line 3 as follows:
let $c_orgs := $orgs[parent/@id = $org_id] where $c_orgs
On Sat, Jul 16, 2016 at 1:16 AM, Carl Leitner litlfred@ibiblio.org wrote:
Hmm. Looking at your query: 01: declare function extractor:get_child_orgs-forked($orgs,$org) { 02: for $org_id in $org/@id 03: for $c_orgs in $orgs[parent/@id = $org_id] 04: return xquery:fork-join( 05: for $c_org in $c_orgs 06: return function() { 07: $c_org, extractor:get_child_orgs-forked($orgs, $c_org) 08: } 09: ) 10: };
This means that when you get to line 05 we are looping over a single element and the enclosing fork-join will only be joining a single function/thread. Am I misreading that?
Cheers, -carl
On Jul 15, 2016, at 5:32 PM, Christian Grün christian.gruen@gmail.com wrote:
Hi Carl,
Running your version of the query does not exhaust memory as mine had, however I don’t see the CPU usage using more than one slightly more than available processor.
Your initial fork query was creating an endless loop. In the current one, I assume that only one function will be created for each xquery:fork-join call. Maybe you need to spend some more time on the question how your code could actually be forked in a recursive way at all?
What you probably want is a xquery:fork() function (without join), but we didn’t add such a function so far because it would be much more difficult to eventually join the results and find a good order.
Cheers Christian
Hi Carl,
I have encountered another issue. This occurs with the July16 snapshot and occurs when you do a fork-join when the forked functions contain a variable reference to another function.
For some reason, it seems to work on my system. However, the id in two of the the stack traces (010a30f) indicates to me that you were trying an older snapshot than the one from July 16 (the empty sequence fix was only introduced with f7d8744 [1]). Could you once again check the latest version [2]?
Thanks in advance Christian
[1] https://github.com/BaseXdb/basex/commits/master [2] http://files.basex.org/releases/latest/
https://github.com/litlfred/extractor/blob/master/fork-function-var.xq Note that this type of fork-join was working with the async module in 8.4
Here are some the error messages I was getting (they would intermittently change): https://gist.github.com/litlfred/93411a892ddb4a5b5efa72130cd94d30 https://gist.github.com/litlfred/93dc73e7b0f050583da7bb79b5e31f5a https://gist.github.com/litlfred/c546722e78cd4217be943e24eb868df0
Thanks in advance for your help.
Cheers, -carl
On Jul 16, 2016, at 2:57 AM, Christian Grün christian.gruen@gmail.com wrote:
Hi Carl,
Thanks again for your observation! Only now, I noticed that xquery:fork-join did weird things when specifying an empty sequence as argument.
This has been fixed [1]; without the latest snapshot [2], you can get rid of the "where $c_orgs" clause.
I tried your original fork query, and it now terminates (it should generate the same result as the unparallelized query after removing the trace() within the function, or replacing it with prof:dump).
Hope this helps, Christian
[1] https://github.com/BaseXdb/basex/commit/f7d8744e2760ed1531b48a9c9de92f3694e6... [2] http://files.basex.org/releases/latest
On Sat, Jul 16, 2016 at 8:38 AM, Christian Grün christian.gruen@gmail.com wrote:
It seems that $c_orgs as defined in line 03 will be a single element.
Exactly. Maybe it’s sufficient to rewrite line 3 as follows:
let $c_orgs := $orgs[parent/@id = $org_id] where $c_orgs
On Sat, Jul 16, 2016 at 1:16 AM, Carl Leitner litlfred@ibiblio.org wrote:
Hmm. Looking at your query: 01: declare function extractor:get_child_orgs-forked($orgs,$org) { 02: for $org_id in $org/@id 03: for $c_orgs in $orgs[parent/@id = $org_id] 04: return xquery:fork-join( 05: for $c_org in $c_orgs 06: return function() { 07: $c_org, extractor:get_child_orgs-forked($orgs, $c_org) 08: } 09: ) 10: };
This means that when you get to line 05 we are looping over a single element and the enclosing fork-join will only be joining a single function/thread. Am I misreading that?
Cheers, -carl
On Jul 15, 2016, at 5:32 PM, Christian Grün christian.gruen@gmail.com wrote:
Hi Carl,
Running your version of the query does not exhaust memory as mine had, however I don’t see the CPU usage using more than one slightly more than available processor.
Your initial fork query was creating an endless loop. In the current one, I assume that only one function will be created for each xquery:fork-join call. Maybe you need to spend some more time on the question how your code could actually be forked in a recursive way at all?
What you probably want is a xquery:fork() function (without join), but we didn’t add such a function so far because it would be much more difficult to eventually join the results and find a good order.
Cheers Christian
Strange. I tried a clean install and it worked fine. I had some corruption somewhere, but not sure where. Thanks! Cheers, -carl
On Jul 17, 2016, at 7:54 AM, Christian Grün christian.gruen@gmail.com wrote:
Hi Carl,
I have encountered another issue. This occurs with the July16 snapshot and occurs when you do a fork-join when the forked functions contain a variable reference to another function.
For some reason, it seems to work on my system. However, the id in two of the the stack traces (010a30f) indicates to me that you were trying an older snapshot than the one from July 16 (the empty sequence fix was only introduced with f7d8744 [1]). Could you once again check the latest version [2]?
Thanks in advance Christian
[1] https://github.com/BaseXdb/basex/commits/master [2] http://files.basex.org/releases/latest/
https://github.com/litlfred/extractor/blob/master/fork-function-var.xq Note that this type of fork-join was working with the async module in 8.4
Here are some the error messages I was getting (they would intermittently change): https://gist.github.com/litlfred/93411a892ddb4a5b5efa72130cd94d30 https://gist.github.com/litlfred/93dc73e7b0f050583da7bb79b5e31f5a https://gist.github.com/litlfred/c546722e78cd4217be943e24eb868df0
Thanks in advance for your help.
Cheers, -carl
On Jul 16, 2016, at 2:57 AM, Christian Grün christian.gruen@gmail.com wrote:
Hi Carl,
Thanks again for your observation! Only now, I noticed that xquery:fork-join did weird things when specifying an empty sequence as argument.
This has been fixed [1]; without the latest snapshot [2], you can get rid of the "where $c_orgs" clause.
I tried your original fork query, and it now terminates (it should generate the same result as the unparallelized query after removing the trace() within the function, or replacing it with prof:dump).
Hope this helps, Christian
[1] https://github.com/BaseXdb/basex/commit/f7d8744e2760ed1531b48a9c9de92f3694e6... [2] http://files.basex.org/releases/latest
On Sat, Jul 16, 2016 at 8:38 AM, Christian Grün christian.gruen@gmail.com wrote:
It seems that $c_orgs as defined in line 03 will be a single element.
Exactly. Maybe it’s sufficient to rewrite line 3 as follows:
let $c_orgs := $orgs[parent/@id = $org_id] where $c_orgs
On Sat, Jul 16, 2016 at 1:16 AM, Carl Leitner litlfred@ibiblio.org wrote:
Hmm. Looking at your query: 01: declare function extractor:get_child_orgs-forked($orgs,$org) { 02: for $org_id in $org/@id 03: for $c_orgs in $orgs[parent/@id = $org_id] 04: return xquery:fork-join( 05: for $c_org in $c_orgs 06: return function() { 07: $c_org, extractor:get_child_orgs-forked($orgs, $c_org) 08: } 09: ) 10: };
This means that when you get to line 05 we are looping over a single element and the enclosing fork-join will only be joining a single function/thread. Am I misreading that?
Cheers, -carl
On Jul 15, 2016, at 5:32 PM, Christian Grün christian.gruen@gmail.com wrote:
Hi Carl,
Running your version of the query does not exhaust memory as mine had, however I don’t see the CPU usage using more than one slightly more than available processor.
Your initial fork query was creating an endless loop. In the current one, I assume that only one function will be created for each xquery:fork-join call. Maybe you need to spend some more time on the question how your code could actually be forked in a recursive way at all?
What you probably want is a xquery:fork() function (without join), but we didn’t add such a function so far because it would be much more difficult to eventually join the results and find a good order.
Cheers Christian
basex-talk@mailman.uni-konstanz.de