Hi Christian,

As promised here is a summary of my experimentation. I replaced the expensive join with a map lookup and the program finished in 4 minutes vs. 1 hour using a naive loop over the two databases (the original 6 hours reported were due to overly aggressive virus scanning software, which I turned off for this benchmarking).

The downside of not using “contains text” inside the double loop (due to its slowness) is that I had to tokenize the CT.gov interventions and remove stopwords prior to looking them up in the DrugBank map. This is a subpar solution as some drugs are missed (looking up all the possible word combinations would be expensive).

It would be nice if there was a way to combine the matching flexibility of the “contains text” construct (with its myriad of options) and the efficiency of a map lookup but that may require a finite-state automaton such as the Aho–Corasick algorithm. If you are aware of any existing solutions I would appreciate your sharing them.

Thanks,
Ron

On August 4, 2018 at 8:47:49 PM, Ron Katriel (rkatriel@mdsol.com) wrote:

Hi Christian,

Thanks for the advise. The BaseX engine is phenomenal so I realized quickly that the problem was performing a naive cross product.

Since this query is run only once a month (to serialize XML to CSV) and applied to new data (DB) each time, a BaseX map will likely be the most straightforward solution (I used the same idea for another project with good results).

I will not be able to implement and test this for another couple of weeks but will summarize my findings to the group as soon as possible.

Best,
Ron


> On Aug 4, 2018, at 6:00 AM, Christian Grün <christian.gruen@gmail.com> wrote:
>
> Hi Ron,
>
>> I believe the slow execution may be due to a combinatorial issue: the cross product of 280,000 clinical trials and ~10,000 drugs in DrugBank (not counting synonyms).
>
> Yes, this sounds like a pretty expensive operation. Having maps
> (XQuery, Java) will be much faster indeed.
>
> As Gerrit suggested, and if you will run your query more than once, it
> would definitely be another interesting option to build an auxiliary,
> custom "index database" that allows you to do exact searches (this
> database may still have references to your original data sets). Since
> version 9 of BaseX, volatile hash maps will be created for looped
> string comparisons. See the following example:
>
> let $values1 := (1 to 500000) ! string()
> let $values2 := (500001 to 1000000) ! string()
> return $values1[. = $values2]
>
> Algorithmically, 500'000 * 500'000 string comparisons will need to be
> performed, resulting in a total of 250 billion operations (and no
> results). The runtime is much faster as you might expect (and, as far
> as I can judge, much faster than in any other XQuery processor).
>
> Best,
> Christian