Hi Christian,
Thanks for your help.
Running your version of the query does not exhaust memory as mine had, however I don’t see the CPU usage using more than one slightly more than available processor. Excepting a initial 1 second spike, it’s running at around 101% on a two CPU machine - so not parallelized. If you change the node we are extracting (which will extract 17,000 nodes) in the extract_hierarchy-forked.xq you should be able to see this:
let $node := <organization id="urn:uuid:a0c7c9cb-cdc4-4d24-b644-04dfcd45f9ea"/>
Any ideas?
Cheers, -carl
On Jul 15, 2016, at 4:43 PM, Christian Grün christian.gruen@gmail.com wrote:
Hi Carl,
I finally had a look at your query. The parallelized variant of your query was not 100% equivalent to the first one. The following version should do the job:
declare function extractor:get_child_orgs-forked($orgs,$org) { for $org_id in $org/@id for $c_orgs in $orgs[parent/@id = $org_id] return xquery:fork-join( for $c_org in $c_orgs return function() { $c_org, extractor:get_child_orgs-forked($orgs, $c_org) } ) };
If I first load the organizations.xml into the database it takes 25 seconds to run (both before and after I run optimize). If I run the extraction directly against the organizations.xml file on disk it only takes 7 seconds.
Is that to be expected?
Yes it is. The reason is that access to a database will always be a bit slower than memory access. You can explicitly convert database to main-memory fragments by using the update keyword:
db:open('organization') update {}
…but that’s only recommendable for smaller fragments and the ones that are frequently accessed.
Cheers Christian