Best approach for 10 million+ documents? - BaseX-Talk - mailman.uni-konstanz.de

13 Jul 2011


      Hi,
I'm currently evaluating BaseX for a project. I've read all the online
documentation but the underlying storage and indexing mechanisms are
still a bit of a mystery to me, so I'm having trouble making optimal
decisions in designing a large collection of documents.
I have 10 million documents of moderate size. These are intended to be
regularly replaced/updated.
I have the choice of storing each document individually in a collection,
or inserting/updating into a single document. Which approach will
generally perform better?
In an experiment, I found that after adding a few million documents,
adding new documents got really slow. The JVM pegs at 100% CPU so it is
doing a lot of work. What's going on here? Indexing? Would increasing
the JVM memory help? Can indexing be disabled for bulk loads?
Rather than try random things to see what worked, I was hoping to get
some insight into how the system stores, indexes and uses resources.
Many thanks,
Michael