Hi,
Reaching out to get suggestions on improving performance. Using basex to store and analyze around 350,000 to 500,000 XMLs. Size of each XML varies between a few KBs to 5MB. Each day around 10k XMLs get added/patched. I have the following queries 1) What is the optimal size or number of documents in a DB? Earlier I had 1 DB with different collections but inserts were too slow, took more than 30s just to replace a document. So split it up by some category to have around 30 DBs. Inserts are fine but again if there are too many documents in a category, patching that DB slows and querying across all DBs also gets slowed down. Any optimal number for DBs? Can I create many DBs like 1 for every 10K XMLs? I read through https://www.mail-archive.com/basex-talk@mailman.uni-konstanz.de/msg06310.htm..., having 100s of DBs cause query performance degradation? Is there any better solution? 2) Query performance has degraded with more documents in a DB. I also noticed that with/without token/attribute index, there is not much difference to query performance (they are just XML attribute queries). "Optimize" flag after inserts to recreate the index takes too much time and memory. I am not running it now since I didn't find significant improvement with/without index with my tests. Any suggestions for improving this? 3) Is it possible to just run queries against specific XMLs? I will have a pre-filter based on user selection and queries need to be run against only those XMLs. There are a number of filters users can apply and every time it can result in a different set of XMLs against which analysis has to be performed (Hence not feasible to create so many collections). Right now, I am querying against all XMLs even though I am interested only in a subset of XMLs and doing post filtering. I did go through https://mailman.uni-konstanz.de/pipermail/basex-talk/2010-July/000495.html, but again having a regex to include all the interested file paths(sometimes entire set of documents) will slow it down.
Thank you, Deepak