Hi Deepak,
My two cents advice :
I found efficient to :
Do you think this could be applied to your use case ?
Best regards from French west coast,
Fabrice.
De : BaseX-Talk <basex-talk-bounces@mailman.uni-konstanz.de>
De la part de Deepak Dinakara
Envoyé : jeudi 28 décembre 2023 09:39
À : basex-talk@mailman.uni-konstanz.de
Objet : [basex-talk] Help - Regarding Performance Improvement
Ce mail provient d’un expéditeur extérieur à la MAIF. En cas de doute : ne répondez pas, ne cliquez pas sur les liens ou pièces jointes et signalez le message via les boutons « signaler
» et/ou « signaler un hameçonnage ».
Hi,
Reaching out to get suggestions on improving performance.
Using basex to store and analyze around 350,000 to 500,000 XMLs.
Size of each XML varies between a few KBs to 5MB. Each day around 10k XMLs get added/patched.
I have the following queries
1) What is the optimal size or number of documents in a DB? Earlier I had 1 DB with different collections but inserts were too slow, took more than 30s just to replace a document. So split it up by some category to have around 30 DBs. Inserts
are fine but again if there are too many documents in a category, patching that DB slows and querying across all DBs also gets slowed down. Any optimal number for DBs? Can I create many DBs like 1 for every 10K XMLs? I read through https://www.mail-archive.com/basex-talk@mailman.uni-konstanz.de/msg06310.html,
having 100s of DBs cause query performance degradation? Is there any better solution?
2) Query performance has degraded with more documents in a DB. I also noticed that with/without token/attribute index, there is not much difference to query performance (they are just XML attribute queries). "Optimize" flag after inserts to
recreate the index takes too much time and memory. I am not running it now since I didn't find significant improvement with/without index with my tests. Any suggestions for improving this?
3) Is it possible to just run queries against specific XMLs? I will have a pre-filter based on user selection and queries need to be run against only those XMLs. There are a number of filters users can apply and every time it can result
in a different set of XMLs against which analysis has to be performed (Hence not feasible to create so many collections). Right now, I am querying against all XMLs even though I am interested only in a subset of XMLs and doing post filtering. I did go through https://mailman.uni-konstanz.de/pipermail/basex-talk/2010-July/000495.html,
but again having a regex to include all the interested file paths(sometimes entire set of documents) will slow it down.
Thank you,
Deepak