(no subject)

List overview All Threads
Download

newer

older

fn:count() performance

attribute-range()

Charles Bearden

28 Mar 2022 28 Mar '22

5:59 p.m.

I have loaded all the Pubmed baseline XML records into a series of 20 BaseX databases, 55 or 56 of the baseline files per database, each of which is about 12GB in size and has between 520 and 530 million nodes in 55 or 56 documents. Text, Token, and Attribute indices are enabled, but with 6GB RAM allocated to the Java VM it would not create a full text index. Each of the 55 or 56 documents has 30000 article records in it under a root PubmedArticleSet element.

I typically use basexgui for interactive work and basex for scripted loads & queries, and I allocate 6G to the Java VM in each case:

BASEX_JVM="-Xmx6g $BASEX_JVM"

I'm exploring ways to search the data in a moderately performant way, starting with the realtively simple lookup by PubMed ID:

/PubmedArticleSet/PubmedArticle[MedlineCitation/PMID[text()=$pmid]]

I generated a seriers of five index XML files that pair PubMed IDs with database names, like so:

<index> <entry> <dbname>pmed_baseline_a</dbname> <pmid>579614</pmid> </entry> … </index>

Each file contains entries for four of the 20 BaseX databases. I loaded the five files into a single database.

My hope was that I could quickly lookup the name of the database that contained a record by that record's PMID, and that I could then open that collection and quickly obtain that record, but it isn't working the way I had hoped.

If I query the index database by PMID, I get the answer in 156ms:

let $pmid := '22345065' let $icoll := collection('idx_pmed_baseline') let $pmid_lookup := $icoll/index/entry/pmid let $entry := $pmid_lookup[text()=$pmid] let $dbname := $entry/parent::entry/dbname/text() return $dbname (: returns 'pmed_baseline_s' :)

If I open and query that collection by XPath for that PMID, also get the answer quickly, in about 420ms:

let $coll := collection('pmed_baseline_s') let $pmid := '22345065' let $wanted := $coll/PubmedArticleSet/PubmedArticle[MedlineCitation/PMID/text()=$pmid] return $wanted (: returns desired XML record :)

But if I combine the code to run in a single execution, it takes about 40s:

let $coll := collection($dbname) let $wanted := $coll/PubmedArticleSet/PubmedArticle[MedlineCitation/PMID/text()=$pmid] return $wanted

I feel like I must be doing some simple thing wrong, but the only difference I see in my code between the two separate steps and the single execution version is that I'm passing the db name in a variable instead of as a string literal to the collection() function, and I'm running the whole thing in a single execution.

Note that before each execution of an XQuery, I exited basexgui and restarted it to avoid any caching effect in memory at least.

The VM I'm running on is modest (spinning drives in RAID 1, four modest AMD CPU cores, dynamic memory growth up to 32GB). But these factors would not explain the difference in speed between the two steps in separate executions and both steps in a single execution.

Can anyone point out what I'm doing wrong? And is there a better way to go about this?

Many thanks & all the best, Chuck

Attachments:

attachment.html (text/html — 4.9 KB)

Show replies by date

ETANCHAUD Fabrice

29 Mar 29 Mar

3:52 a.m.

Hi Charles, If I remember well, it is because of the dynamic call to collection() : when you call collection('my static db name'), the parser can rewrite it to use an index, but not when you call collection($my_dynamic_db_name).

I wonder if you could get better results using db:text() function. Could you give something like that a try ?

let $pmid := '22345065' let $dbname := db:text('idx_pmed_baseline', $pmid)/parent::pmid/../dbname return db:text($dbname, $pmid)/parent::PMID/ancestor::PubmedArticle

Best regards, Fabrice

________________________________ De : BaseX-Talk basex-talk-bounces@mailman.uni-konstanz.de de la part de Charles Bearden cfbmdacc@gmail.com Envoyé : lundi 28 mars 2022 23:59 À : BaseX basex-talk@mailman.uni-konstanz.de Objet : [basex-talk] (no subject)