On August 4, 2018 at 8:47:49 PM, Ron Katriel (rkatriel@mdsol.com) wrote:
Hi Christian,
Thanks for the advise. The BaseX engine is phenomenal so I realized quickly that the problem was performing a naive cross product.
Since this query is run only once a month (to serialize XML to CSV) and applied to new data (DB) each time, a BaseX map will likely be the most straightforward solution (I used the same idea for another project with good results).
I will not be able to implement and test this for another couple of weeks but will summarize my findings to the group as soon as possible.
Best,
Ron
> On Aug 4, 2018, at 6:00 AM, Christian Grün <christian.gruen@gmail.com> wrote:
>
> Hi Ron,
>
>> I believe the slow execution may be due to a combinatorial issue: the cross product of 280,000 clinical trials and ~10,000 drugs in DrugBank (not counting synonyms).
>
> Yes, this sounds like a pretty expensive operation. Having maps
> (XQuery, Java) will be much faster indeed.
>
> As Gerrit suggested, and if you will run your query more than once, it
> would definitely be another interesting option to build an auxiliary,
> custom "index database" that allows you to do exact searches (this
> database may still have references to your original data sets). Since
> version 9 of BaseX, volatile hash maps will be created for looped
> string comparisons. See the following example:
>
> let $values1 := (1 to 500000) ! string()
> let $values2 := (500001 to 1000000) ! string()
> return $values1[. = $values2]
>
> Algorithmically, 500'000 * 500'000 string comparisons will need to be
> performed, resulting in a total of 250 billion operations (and no
> results). The runtime is much faster as you might expect (and, as far
> as I can judge, much faster than in any other XQuery processor).
>
> Best,
> Christian