Hi Deepak,

My two cents advice :

I found efficient to :

Use a append only approach : instead of replacing a document, just add it to your collections. You will have to find how to get the latest version of a given logical document, based on a business key to be identified. For that purpose, you can create a kind of buffer (set of) collection(s) containing all the incoming documents.
On schedule, build new collections containing only the latest versions taken from the additive collections (a new backlog), empty your buffers, and restart.

Do you think this could be applied to your use case ?

Best regards from French west coast,

Fabrice.

De : BaseX-Talk <basex-talk-bounces@mailman.uni-konstanz.de> De la part de Deepak Dinakara
Envoyé : jeudi 28 décembre 2023 09:39
À : basex-talk@mailman.uni-konstanz.de
Objet : [basex-talk] Help - Regarding Performance Improvement

Ce mail provient d’un expéditeur extérieur à la MAIF. En cas de doute : ne répondez pas, ne cliquez pas sur les liens ou pièces jointes et signalez le message via les boutons « signaler » et/ou « signaler un hameçonnage ».

Hi,

Reaching out to get suggestions on improving performance.

Using basex to store and analyze around 350,000 to 500,000 XMLs.

Size of each XML varies between a few KBs to 5MB. Each day around 10k XMLs get added/patched.

I have the following queries

1) What is the optimal size or number of documents in a DB? Earlier I had 1 DB with different collections but inserts were too slow, took more than 30s just to replace a document. So split it up by some category to have around 30 DBs. Inserts are fine but again if there are too many documents in a category, patching that DB slows and querying across all DBs also gets slowed down. Any optimal number for DBs? Can I create many DBs like 1 for every 10K XMLs? I read through https://www.mail-archive.com/basex-talk@mailman.uni-konstanz.de/msg06310.html, having 100s of DBs cause query performance degradation? Is there any better solution?

2) Query performance has degraded with more documents in a DB. I also noticed that with/without token/attribute index, there is not much difference to query performance (they are just XML attribute queries). "Optimize" flag after inserts to recreate the index takes too much time and memory. I am not running it now since I didn't find significant improvement with/without index with my tests. Any suggestions for improving this?

3) Is it possible to just run queries against specific XMLs? I will have a pre-filter based on user selection and queries need to be run against only those XMLs. There are a number of filters users can apply and every time it can result in a different set of XMLs against which analysis has to be performed (Hence not feasible to create so many collections). Right now, I am querying against all XMLs even though I am interested only in a subset of XMLs and doing post filtering. I did go through https://mailman.uni-konstanz.de/pipermail/basex-talk/2010-July/000495.html, but again having a regex to include all the interested file paths(sometimes entire set of documents) will slow it down.

Thank you,

Deepak