Charles -

On Thu, May 19, 2022 at 12:46 PM Charles Bearden <cfbmdacc@gmail.com> wrote:
Thanks to Graydon, Tamara, and Christian for responding!

I figured out a pretty fast way to exploit the infrastructure I had built (the files allocated out into many databases and a single index database generated from the databases).

Here is a sample record from my index database:

<entry>
  <dbname>pmed_updates_b</dbname>
  <pmid>34239076</pmid>
  <version>1</version>
  <path>pubmed22n1145.xml</path>
  <date_revised>2022-01-09</date_revised>
</entry>


As it happens, there are eight versions of this record scattered across 7 of the component databases and located in 8 input files (two of the input files were allocated to one of the databases). Each of these instances has an entry in the index database.

My approach has four steps:
  1. retrieve all entries from the index database that have the desired PMID;
  2. convert the sequence of XML entries into a sequence of maps with the same data, ordering by filename descending, so that the most recent file is the first element of the sequence;
  3. take the first item/map of the sequence;
  4. look up all occurrences of records with that PMID in the database specified in the first item and call db:path() on each item and compare it to the filename specified in the most recent record; the record whose db:path() matches the item/map taken in step three is the most recent version of the record with that PMID.
Files are allocated by modulo to the different databases, so it is conceivable that a database will have more than one record with a given PMID, hence the necessity of comparing each record's path with the one given in the map from step three to determine which is the most recent.

Very neat. I had a thought that `db:list-details()`, specifically the 2nd signature, would be useful here but now that I've 1) read your solution, and 2) tried to play with some examples, I don't think it would be a very helpful fit.
 
Given the above PMID (for which there are eight versions of the record, as noted above) it took less than half a second to retrieve the most recent instance of that record out of over 35 million records.

I can post the XQuery if anyone wants to see it. It would take longer to document how I build the content & index databases, and I still have to work out the best way to keep it all up to date.

Selfishly, I'd be very interested in seeing examples but don't put yourself through any trouble.

All the best,
Chuck
--
Sr Systems Analyst
University of Texas M.D. Anderson Cancer Center



Thanks for the interesting example.
Best,
Bridger