Maps that reference nodes are pointers, rather than copies. It sounds like you could map every drug name to every "interesting" XML node that contains it using grouping during map creation and then just iterate on the keys to process the nodes.
On Sun, Sep 2, 2018 at 4:52 PM Ron Katriel rkatriel@mdsol.com wrote:
Hi Christian,
As promised here is a summary of my experimentation. I replaced the expensive join with a map lookup and the program finished in 4 minutes vs. 1 hour using a naive loop over the two databases (the original 6 hours reported were due to overly aggressive virus scanning software, which I turned off for this benchmarking).
The downside of not using “contains text” inside the double loop (due to its slowness) is that I had to tokenize the CT.gov interventions and remove stopwords prior to looking them up in the DrugBank map. This is a subpar solution as some drugs are missed (looking up all the possible word combinations would be expensive).
It would be nice if there was a way to combine the matching flexibility of the “contains text” construct (with its myriad of options) and the efficiency of a map lookup but that may require a finite-state automaton such as the Aho–Corasick algorithm. If you are aware of any existing solutions I would appreciate your sharing them.
Thanks, Ron
On August 4, 2018 at 8:47:49 PM, Ron Katriel (rkatriel@mdsol.com) wrote:
Hi Christian,
Thanks for the advise. The BaseX engine is phenomenal so I realized quickly that the problem was performing a naive cross product.
Since this query is run only once a month (to serialize XML to CSV) and applied to new data (DB) each time, a BaseX map will likely be the most straightforward solution (I used the same idea for another project with good results).
I will not be able to implement and test this for another couple of weeks but will summarize my findings to the group as soon as possible.
Best, Ron
On Aug 4, 2018, at 6:00 AM, Christian Grün christian.gruen@gmail.com
wrote:
Hi Ron,
I believe the slow execution may be due to a combinatorial issue: the
cross product of 280,000 clinical trials and ~10,000 drugs in DrugBank (not counting synonyms).
Yes, this sounds like a pretty expensive operation. Having maps (XQuery, Java) will be much faster indeed.
As Gerrit suggested, and if you will run your query more than once, it would definitely be another interesting option to build an auxiliary, custom "index database" that allows you to do exact searches (this database may still have references to your original data sets). Since version 9 of BaseX, volatile hash maps will be created for looped string comparisons. See the following example:
let $values1 := (1 to 500000) ! string() let $values2 := (500001 to 1000000) ! string() return $values1[. = $values2]
Algorithmically, 500'000 * 500'000 string comparisons will need to be performed, resulting in a total of 250 billion operations (and no results). The runtime is much faster as you might expect (and, as far as I can judge, much faster than in any other XQuery processor).
Best, Christian