Hi,
I'm currently evaluating BaseX for a project. I've read all the online documentation but the underlying storage and indexing mechanisms are still a bit of a mystery to me, so I'm having trouble making optimal decisions in designing a large collection of documents.
I have 10 million documents of moderate size. These are intended to be regularly replaced/updated.
I have the choice of storing each document individually in a collection, or inserting/updating into a single document. Which approach will generally perform better?
In an experiment, I found that after adding a few million documents, adding new documents got really slow. The JVM pegs at 100% CPU so it is doing a lot of work. What's going on here? Indexing? Would increasing the JVM memory help? Can indexing be disabled for bulk loads?
Rather than try random things to see what worked, I was hoping to get some insight into how the system stores, indexes and uses resources.
Many thanks,
Michael
Michael,
thanks for your mail, and sorry for the delay.
I'm currently evaluating BaseX for a project. I've read all the online documentation but the underlying storage and indexing mechanisms are still a bit of a mystery to me, so I'm having trouble making optimal decisions in designing a large collection of documents.
Our publications include some information on the low level structures of BaseX (“Storing and Querying Large XML Instances” will be most relevant in this context):
http://basex.org/about-us/publications/
I have the choice of storing each document individually in a collection, or inserting/updating into a single document. Which approach will generally perform better?
In most cases, you'll get better performance if you updates nodes into one large document than adding lots of tiny documents. Next, if you create a new database, it's always faster to add the documents in the same step (instead of adding them afterwards).
In an experiment, I found that after adding a few million documents, adding new documents got really slow. The JVM pegs at 100% CPU so it is doing a lot of work. What's going on here? Indexing? Would increasing the JVM memory help? Can indexing be disabled for bulk loads?
True; most of the time is spent for updating the references to all document nodes. We've recently added one tiny GitHub entry, which is related to that issue (although still poorly documented):
https://github.com/BaseXdb/basex/issues/137
Rather than try random things to see what worked, I was hoping to get some insight into how the system stores, indexes and uses resources.
I hope that the link mentioned above gives you the required background information. If not.. This is open source, and the complete code base is freely available, so we're open for (and dependent on) any contributions that will improve the state-of-the-art.
Christian
basex-talk@mailman.uni-konstanz.de