Selecting nodes in a db by source file name

List overview All Threads
Download

newer

older

ODBC support for BaseX?

Checking run time characteristics...

Charles Bearden

18 May 2022 18 May '22

3:23 p.m.

Let's say I have a database built from 40+ XML documents, each with about 4 million nodes for a total of about 404 million nodes, of which 30k are high-level records of interest. Let's say that a given node of interest (a PubMed record, actually) may occur once in this database for each time it has been updated, and I can identify the most recent version of that record by the PubMed ID and the filename (an ascending sort of the filenames sorts them from least recent to most recent).

Is there a way in XQuery I can identify a record in this database not only by its PubMed ID but also limit the result set by source document name? I know that `db:path($elem)` gives me the document name for an element, but I want to use the document name to help pick out the element from among several that may share its PMID.

Context: I have 26 such databases, each of which contains 43-44 source files, each of which contains 30000 records. I have built one index database that, for each record, includes PubMed ID, source filename, and database name. For a given PMID I can very quickly retrieve the db name and the file name, but I need to search the db not only by PMID but also by source filename in order to select the most recent version of a record.

Thank you & all the best, Chuck Bearden

Attachments:

attachment.html (text/html — 1.4 KB)

Show replies by date

Graydon

18 May 18 May

5:09 p.m.

On Wed, May 18, 2022 at 02:23:54PM -0500, Charles Bearden scripsit:

...

Is there a way in XQuery I can identify a record in this database not only by its PubMed ID but also limit the result set by source document name? I know that `db:path($elem)` gives me the document name for an element, but I want to use the document name to help pick out the element from among several that may share its PMID.

db:path($elem)[local:test-function(.)]

is the usual pattern for this, but it sounds like you're really trying to index into this mass of records.

That makes me think of a map with a composite key -- concat($PMIDval,$db:path($elem)) -- and doing updates via map:merge. If the value for the individual map entry is the result of a path expression, you should get what amounts to a pointer to the node in the db, making this approach fast for retrieval.

That's not as good as a way to get a custom index the uses the document-uri() property of the document nodes (or possibly file:name() on the document-uri() property), but someone else certainly knows more about the feasability of that appraoch than I do.

-- Graydon Saunders | graydonish@gmail.com Þæs oferéode, ðisses swá mæg. -- Deor ("That passed, so may this.")

Tamara Marnell

5:59 p.m.

Hi Charles,

If you're just looking to filter results by a known filename, you can do that right on the results of db:open() or collection(), followed by the rest of your XPath and filters.

for $myElement in db:open('myDB')[db:path(.)='myFile1.xml']/myRoot[@PMID='1234']/myElement return $myElement/text()

-Tamara

On Wed, May 18, 2022 at 2:09 PM Graydon graydonish@gmail.com wrote:

...

On Wed, May 18, 2022 at 02:23:54PM -0500, Charles Bearden scripsit:

...
Is there a way in XQuery I can identify a record in this database not

only

...
by its PubMed ID but also limit the result set by source document name? I know that `db:path($elem)` gives me the document name for an element,

but I

...
want to use the document name to help pick out the element from among several that may share its PMID.

db:path($elem)[local:test-function(.)]

is the usual pattern for this, but it sounds like you're really trying to index into this mass of records.

That makes me think of a map with a composite key -- concat($PMIDval,$db:path($elem)) -- and doing updates via map:merge. If the value for the individual map entry is the result of a path expression, you should get what amounts to a pointer to the node in the db, making this approach fast for retrieval.

That's not as good as a way to get a custom index the uses the document-uri() property of the document nodes (or possibly file:name() on the document-uri() property), but someone else certainly knows more about the feasability of that appraoch than I do.

-- Graydon Saunders | graydonish@gmail.com Þæs oferéode, ðisses swá mæg. -- Deor ("That passed, so may this.")

-- Tamara Marnell Program Manager, Systems Orbis Cascade Alliance (orbiscascade.org https://www.orbiscascade.org/) Pronouns: she/her/hers

Christian Grün

19 May 19 May

1:39 a.m.

Hi Charles,

...

Is there a way in XQuery I can identify a record in this database not only by its PubMed ID but also limit the result set by source document name?

If the source document name equals the document path in the database, you can supply it as second argument to db:open [1]:

db:open($db, $path)

You could combine this query with an id check:

db:open($db, $path)//...[pmid = $pmid]

...

I have 26 such databases, each of which contains 43-44 source files, each of which contains 30000 records. I have built one index database that, for each record, includes PubMed ID, source filename, and database name. For a given PMID I can very quickly retrieve the db name and the file name, but I need to search the db not only by PMID but also by source filename in order to select the most recent version of a record.

If all document versions occur in the same database, and if the number of versions is limited, a query could look as follows:

let $path := '...' let $pmid := '...' let $db := (: ...retrieved from index database :) let db:open($db, $path)

Hope this helps, Christian

[1] https://docs.basex.org/wiki/Database_Module#db:open

Charles Bearden

12:45 p.m.

Thanks to Graydon, Tamara, and Christian for responding!

I figured out a pretty fast way to exploit the infrastructure I had built (the files allocated out into many databases and a single index database generated from the databases).

Here is a sample record from my index database:

*<entry> <dbname>pmed_updates_b</dbname> <pmid>34239076</pmid> <version>1</version> <path>pubmed22n1145.xml</path> <date_revised>2022-01-09</date_revised></entry>*

As it happens, there are eight versions of this record scattered across 7 of the component databases and located in 8 input files (two of the input files were allocated to one of the databases). Each of these instances has an entry in the index database.

My approach has four steps:

1. retrieve all entries from the index database that have the desired PMID; 2. convert the sequence of XML entries into a sequence of maps with the same data, ordering by filename descending, so that the most recent file is the first element of the sequence; 3. take the first item/map of the sequence; 4. look up all occurrences of records with that PMID in the database specified in the first item and call *db:path()* on each item and compare it to the filename specified in the most recent record; the record whose *db:path()* matches the item/map taken in step three is the most recent version of the record with that PMID.

Files are allocated by modulo to the different databases, so it is conceivable that a database will have more than one record with a given PMID, hence the necessity of comparing each record's path with the one given in the map from step three to determine which is the most recent.

Given the above PMID (for which there are eight versions of the record, as noted above) it took less than half a second to retrieve the most recent instance of that record out of over 35 million records.

I can post the XQuery if anyone wants to see it. It would take longer to document how I build the content & index databases, and I still have to work out the best way to keep it all up to date.

All the best, Chuck -- Sr Systems Analyst University of Texas M.D. Anderson Cancer Center

On Thu, May 19, 2022 at 12:39 AM Christian Grün christian.gruen@gmail.com wrote:

...

Hi Charles,

...
Is there a way in XQuery I can identify a record in this database not

only by its PubMed ID but also limit the result set by source document name?

If the source document name equals the document path in the database, you can supply it as second argument to db:open [1]:

db:open($db, $path)

You could combine this query with an id check:

db:open($db, $path)//...[pmid = $pmid]

...
I have 26 such databases, each of which contains 43-44 source files,

each of which contains 30000 records. I have built one index database that, for each record, includes PubMed ID, source filename, and database name. For a given PMID I can very quickly retrieve the db name and the file name, but I need to search the db not only by PMID but also by source filename in order to select the most recent version of a record.

If all document versions occur in the same database, and if the number of versions is limited, a query could look as follows:

let $path := '...' let $pmid := '...' let $db := (: ...retrieved from index database :) let db:open($db, $path)

Hope this helps, Christian

[1] https://docs.basex.org/wiki/Database_Module#db:open

Bridger Dyson-Smith

2:32 p.m.

Charles -

On Thu, May 19, 2022 at 12:46 PM Charles Bearden cfbmdacc@gmail.com wrote:

...

Thanks to Graydon, Tamara, and Christian for responding!

I figured out a pretty fast way to exploit the infrastructure I had built (the files allocated out into many databases and a single index database generated from the databases).

Here is a sample record from my index database:

*<entry> <dbname>pmed_updates_b</dbname> <pmid>34239076</pmid> <version>1</version> <path>pubmed22n1145.xml</path> <date_revised>2022-01-09</date_revised></entry>*

As it happens, there are eight versions of this record scattered across 7 of the component databases and located in 8 input files (two of the input files were allocated to one of the databases). Each of these instances has an entry in the index database.

My approach has four steps:

retrieve all entries from the index database that have the desired

PMID; 2. convert the sequence of XML entries into a sequence of maps with the same data, ordering by filename descending, so that the most recent file is the first element of the sequence; 3. take the first item/map of the sequence; 4. look up all occurrences of records with that PMID in the database specified in the first item and call *db:path()* on each item and compare it to the filename specified in the most recent record; the record whose *db:path()* matches the item/map taken in step three is the most recent version of the record with that PMID.

Files are allocated by modulo to the different databases, so it is conceivable that a database will have more than one record with a given PMID, hence the necessity of comparing each record's path with the one given in the map from step three to determine which is the most recent.

Very neat. I had a thought that `db:list-details()`, specifically the 2nd

signature, would be useful here but now that I've 1) read your solution, and 2) tried to play with some examples, I don't think it would be a very helpful fit.

...

Given the above PMID (for which there are eight versions of the record, as noted above) it took less than half a second to retrieve the most recent instance of that record out of over 35 million records.

I can post the XQuery if anyone wants to see it. It would take longer to document how I build the content & index databases, and I still have to work out the best way to keep it all up to date.

Selfishly, I'd be very interested in seeing examples but don't put

yourself through any trouble.

All the best,

...

Chuck

...

Sr Systems Analyst University of Texas M.D. Anderson Cancer Center

Thanks for the interesting example. Best, Bridger

1156

Age (days ago)

1157

Last active (days ago)

basex-talk@mailman.uni-konstanz.de

5 comments

5 participants

tags (0)

participants (5)

Bridger Dyson-Smith
Charles Bearden
Christian Grün
Graydon
Tamara Marnell