Re: [basex-talk] bulk-load - BaseX-Talk - mailman.uni-konstanz.de

16 Jul 2011


      Dear Tomaso, dear Michael,
today, I had a closer look into the BaseX routines that are
responsible for adding new documents to the database, and I tweaked
the update code of our document index to avoid linear costs for adding
single documents. You are invited to check out the latest stable
snapshot [1] and give us your valuable feedback.
There are still some bottlenecks:
- the default XML parser takes some time for initialization, which is
particularly noticable for small documents. You'll get some
performance boost by switching to the internal parser (Command: set
intparse; see [2] for details).
- as each BaseX database command is atomic, the data is flushed to
disk after each update to avoid data loss. You may either specify a
directory on disk to add multiple files at once, or choose to insert
nodes instead of documents, which will give you better performance.
Hope this helps,
Christian
[1] http://files.basex.org/releases/latest/basex-6.7.1-SNAPSHOT.jar
[2] http://docs.basex.org/wiki/Parsers
___________________________
On Mon, Jul 4, 2011 at 8:22 PM, Christian Grün
christian.gruen@gmail.com wrote:
...
Hi Tomaso,
...
From your answer I suppose there is something slower
when Add() is called many times, and faster if we use CreateDb.
Can you explain why the times increase? Is it because Add
updates the index each time?
Exactly; as the ADD operation is an atomic operation, there is
currently no way to define a batch operation (other than going more
low level, and looking at the Add command [1]). It might be that some
update operations could be delayed though; I've added this as an issue
(feel free to include more details) [2].
Christian
[1] https://github.com/BaseXdb/basex/blob/master/src/main/java/org/basex/core/cm...
[2] https://github.com/BaseXdb/basex/issues/137