Running your version of the query does not exhaust memory as mine had, however I don’t see the CPU usage using more than one slightly more than available processor. Excepting a initial 1 second spike, it’s running at around 101% on a two CPU machine - so not parallelized. If you change the node we are extracting (which will extract 17,000 nodes) in the extract_hierarchy-forked.xq you should be able to see this:

Any ideas?

Cheers,

-carl

On Jul 15, 2016, at 4:43 PM, Christian Grün <christian.gruen@gmail.com> wrote:

Hi Carl,

I finally had a look at your query. The parallelized variant of your
query was not 100% equivalent to the first one. The following version
should do the job:

declare function extractor:get_child_orgs-forked($orgs,$org) {
   for $org_id in $org/@id
   for $c_orgs in $orgs[parent/@id = $org_id]
   return xquery:fork-join(
     for $c_org in $c_orgs
     return function() {
       $c_org, extractor:get_child_orgs-forked($orgs, $c_org)
     }
   )
};

If I first load the organizations.xml into the database it takes 25 seconds
to run (both before and after I run optimize). If I run the extraction
directly against the organizations.xml file on disk it only takes 7 seconds.

Is that to be expected?

Yes it is. The reason is that access to a database will always be a
bit slower than memory access. You can explicitly convert database to
main-memory fragments by using the update keyword:

db:open('organization') update {}

…but that’s only recommendable for smaller fragments and the ones that
are frequently accessed.

Cheers
Christian