I have loaded all the Pubmed baseline XML records into a series of 20 BaseX
databases, 55 or 56 of the baseline files per database, each of which is
about 12GB in size and has between 520 and 530 million nodes in 55 or 56
documents. Text, Token, and Attribute indices are enabled, but with 6GB RAM
allocated to the Java VM it would not create a full text index. Each of the
55 or 56 documents has 30000 article records in it under a root
PubmedArticleSet element.
I typically use basexgui for interactive work and basex for scripted loads
& queries, and I allocate 6G to the Java VM in each case:
BASEX_JVM="-Xmx6g $BASEX_JVM"
I'm exploring ways to search the data in a moderately performant way,
starting with the realtively simple lookup by PubMed ID:
/PubmedArticleSet/PubmedArticle[MedlineCitation/PMID[text()=$pmid]]
I generated a seriers of five index XML files that pair PubMed IDs with
database names, like so:
<index>
<entry>
<dbname>pmed_baseline_a</dbname>
<pmid>579614</pmid>
</entry>
…
</index>
Each file contains entries for four of the 20 BaseX databases. I loaded the
five files into a single database.
My hope was that I could quickly lookup the name of the database that
contained a record by that record's PMID, and that I could then open that
collection and quickly obtain that record, but it isn't working the way I
had hoped.
If I query the index database by PMID, I get the answer in 156ms:
let $pmid := '22345065'
let $icoll := collection('idx_pmed_baseline')
let $pmid_lookup := $icoll/index/entry/pmid
let $entry := $pmid_lookup[text()=$pmid]
let $dbname := $entry/parent::entry/dbname/text()
return $dbname
(: returns 'pmed_baseline_s' :)
If I open and query that collection by XPath for that PMID, also get the
answer quickly, in about 420ms:
let $coll := collection('pmed_baseline_s')
let $pmid := '22345065'
let $wanted :=
$coll/PubmedArticleSet/PubmedArticle[MedlineCitation/PMID/text()=$pmid]
return $wanted
(: returns desired XML record :)
But if I combine the code to run in a single execution, it takes about 40s:
let $pmid := '22345065'
let $icoll := collection('idx_pmed_baseline')
let $pmid_lookup := $icoll/index/entry/pmid
let $entry := $pmid_lookup[text()=$pmid]
let $dbname := $entry/parent::entry/dbname/text()
let $coll := collection($dbname)
let $wanted :=
$coll/PubmedArticleSet/PubmedArticle[MedlineCitation/PMID/text()=$pmid]
return $wanted
I feel like I must be doing some simple thing wrong, but the only
difference I see in my code between the two separate steps and the single
execution version is that I'm passing the db name in a variable instead of
as a string literal to the collection() function, and I'm running the whole
thing in a single execution.
Note that before each execution of an XQuery, I exited basexgui and
restarted it to avoid any caching effect in memory at least.
The VM I'm running on is modest (spinning drives in RAID 1, four modest AMD
CPU cores, dynamic memory growth up to 32GB). But these factors would not
explain the difference in speed between the two steps in separate
executions and both steps in a single execution.
Can anyone point out what I'm doing wrong? And is there a better way to go
about this?
Many thanks & all the best,
Chuck