Hello again,
I managed to test again today. Unfortunately I still observe the same performance problem in 9.3.2, also with the query that Christian supplied. I also tried in 9.3.3 snapshot - same performance loss as in 9.3.2. Still, everything is working fine in 9.2.4.
For reproducing the problem I assembled a package with all original XML files, the xQueries I execute and a description of the steps I follow (see README file in the package). As the XML-data are licenced under CC0 there should be no problem in sharing them with the community. You can download the whole package here (.zip file with ~150MB):
https://drive.google.com/open?id=1o09YZAqj5Y6ys3oE2tX8JRJ3GKoQ2xUr
I hope that helps tracking down the problem.
Best regards,
Michael
Mag. Michael BirknerAK Wien - Bibliothek
1040, Prinz Eugen Straße 20-22
T: +43 1 501 65 12455
F: +43 1 501 65 142455
M: +43 664 88957669
michael.birkner@akwien.at
wien.arbeiterkammer.at
Besuchen Sie uns auch auf:
facebook | twitter | youtube
--------------------------------------------------
Die AK setzt sich seit 100 Jahren für Gerechtigkeit ein.
Damals. Heute. Für immer.
arbeiterkammer.at/100
Von: Christian Grün <christian.gruen@gmail.com>
Gesendet: Freitag, 8. Mai 2020 14:24
An: BIRKNER Michael
Cc: basex-talk@mailman.uni-konstanz.de
Betreff: Re: [basex-talk] Performance loss between version 9.2.4 and 9.3.2 when executing specific xQueryAnd I’m always delighted to be confronted with library use case. BaseX grew up with library data; at that time, mostly XML variants of MAB2.
I made another intent to reproduce your setting by creating two databases with MARCXML data (rather small, 10.000 and 10 documents each). This is the query I tried:
let $recsFromDb1 := db:open('db1')//*:recordlet $recsFromDb2 := db:open('db2')//*:recordlet $idsFromRecsInDb1 := distinct-values($recsFromDb1/*:controlfield[@tag = '001'])for $id in $idsFromRecsInDb1let $recFromDb2WithSameId := $recsFromDb2[*:controlfield[@tag = '001'] = $id]return $recFromDb2WithSameId
Both query plans and execution times are pretty much the same. Can you tell me what I need to change in my query to simulate the slowdown?
As a preview, I already have an idea how you can boost the query evaluation (provided your databases have up-to-date index structures)…
On Fri, May 8, 2020 at 1:31 PM BIRKNER Michael <Michael.BIRKNER@akwien.at> wrote:
Hi Christian,
thank you for your answers. As you can guess the queries I sent in my original email are just simplified examples.
The real XML structure is like the following (its library data in format "MarcXML", here you see an example: https://www.loc.gov/standards/marcxml/Sandburg/sandburg.xml)
db1: each of the 7489 documents has this structure
<collection>
<record>
<controlfield tag="001">ID-Number</controlfield>
... [more tags named "controlfield" or "datafield"]
</record>
... [more records]
</collection>
So in db1 I have 7489 documents each with a "<collection><record>...</record></collection>" structure, so I have 7489 "collection" nodes.
db2: It's the same structure as above, but there is only 1 "collection" and all "records" are within that "collection".
Some background information:
In db1 I save updated versions of records (downloaded from an OAI-PMH interface, which gives me only 50 records at a time, so I have to page through the results and get 7489 XML-files in the end that I import into db1) that also (partly) exist in db2. So there are multiple records with the same ID (normally only 2 [the original and the updated one, but there could be the case when there are 3 or more records with the same ID because the downloaded updates could contain multiple records with the same ID [an updated one and an update of the updated one and so on ... I know ... complicated]).
One of the records with the same ID is the newest one. My goal is to find the newest one and delete the others (based on a timestamp that is also found in another <controlfield> in the record). So all of this is about updating records in an existing database from downloaded update-files that I get via OAI.
I hope this information helps. And thank you for pointing out the new version 9.3.3. I will try that one.
Best regards,
Michael
Mag. Michael BirknerAK Wien - Bibliothek
1040, Prinz Eugen Straße 20-22
T: +43 1 501 65 12455
F: +43 1 501 65 142455
M: +43 664 88957669
michael.birkner@akwien.at
wien.arbeiterkammer.at
Besuchen Sie uns auch auf:
facebook | twitter | youtube
--------------------------------------------------
Die AK setzt sich seit 100 Jahren für Gerechtigkeit ein.
Damals. Heute. Für immer.
arbeiterkammer.at/100
Von: Christian Grün <christian.gruen@gmail.com>
Gesendet: Freitag, 8. Mai 2020 12:37
An: BIRKNER Michael
Cc: basex-talk@mailman.uni-konstanz.de
Betreff: Re: [basex-talk] Performance loss between version 9.2.4 and 9.3.2 when executing specific xQueryI tried to reproduce your use case by creating some sample data (with a few millions of entries), but both the query plan and the performance were similar in 9.2.4 and the current 9.3.3 beta version.
And I am still trying to understand your example query. Is it correct that the attribute of your exampletag element have static ids, and the text value of the exampletag element contains an id as well? If you can provide me with some example documents of your database, that might help us to track down the problem.
And feel free to check out the latest stable snapshot [1]. BaseX 9.3.3 is close, and lots of new optimizations and rewritings have been added since 9.3.2, so maybe the problem you encountered is already fixed.
On Fri, May 8, 2020 at 10:19 AM BIRKNER Michael <Michael.BIRKNER@akwien.at> wrote:
Hi,
I am observing a performance loss between BaseX versions 9.2.4 (which I was using so far) and 9.3.2 (to which I updated recently) when executing an xQuery like this:
---
(: Open 2 databases and get all <record>s :)
let $recsFromDb1 := db:open('db1')/record
let $recsFromDb2 := db:open('db2')/record
(: Get distinct IDs of all records in db1 :)
let $idsFromRecsInDb1 := distinct-values($recsFromDb1/exampletag[@exampleattr='id'])
(: Iterate over the distinct IDs of db1 and return the records from db2 with the same ID :)
for $id in $idsFromRecsInDb1
let $recFromDb2WithSameId := $recsFromDb2[exampletag[@exampleattr='id']=$id]
return $recFromDb2WithSameId
---
In BaseX version 9.2.4 the query executes very fast (2 - 3 seconds). In 9.3.2 I didn't wait to the end ... I aborted after several minutes because I suspected that something must be wrong.
Both BaseX instances have allocated the same amount of memory (4096MB). The databases (db1 and db2) were created in the respective BaseX version from scratch and contain attribute and text indexes. They were optimized before executing the query mentioned above. All options and preferences are the same in both BaseX instances. I am using the GUI in Ubuntu 18.04.
Here are some more details about the two databases:
db1:
- Size: 2255MB
- Nodes: 97598775
- Documents: 7489
- Uptodate: true
db2:
- Size: 883MB
- Nodes: 46317512
- Documents: 1
- Uptodate: true
Does someone have an idea why there is such a difference in performance between the two BaseX versions?
Thanks for any answers and hints!
Best regards,
Michael
Mag. Michael BirknerAK Wien - Bibliothek
1040, Prinz Eugen Straße 20-22
T: +43 1 501 65 12455
F: +43 1 501 65 142455
M: +43 664 88957669
michael.birkner@akwien.at
wien.arbeiterkammer.at
Besuchen Sie uns auch auf:
facebook | twitter | youtube
--------------------------------------------------
Die AK setzt sich seit 100 Jahren für Gerechtigkeit ein.
Damals. Heute. Für immer.
arbeiterkammer.at/100
Beachten Sie, dass Sie uns ab sofort unter einer geänderten Rufnummer erreichen. Bitte speichern Sie gleich Ihren Kontakt zur AK Wien ein unter 501 65 1, gefolgt von der gewohnten Durchwahl.
Dieses Mail ist ausschließlich für die Verwendung durch die/den darin genannten AdressatInnen bestimmt und kann vertrauliche bzw rechtlich geschützte Informationen enthalten, deren Verwendung ohne Genehmigung durch den/ die AbsenderIn rechtswidrig sein kann.
Falls Sie dieses Mail irrtümlich erhalten haben, informieren Sie uns bitte und löschen Sie die Nachricht.
UID: ATU 16209706 I https://wien.arbeiterkammer.at/datenschutz