Best way to insert large amounts of records - BaseX-Talk - mailman.uni-konstanz.de

18 May 2012


      Hi,
we're using BaseX to store multiple collections of documents (we call
them records).
These record are produced programmatically, by parsing an incoming
stream on a server application and turning it into a document of the
kind
<record id="123" version="1">
...
</record>
So far I took the following approach:
- each collection of records is its own database in BaseX, for easier management
- on insertion
  - set the session's autoflush to false
  - iterate over record
  - add them via add(id, document)
  - each 10000 records, flush
  - finally, flush once more
  - create the attributes index
So for example now we have:
name                                               Resources  Size
   Input Path
------------------------------------------------------------------------------------
col1                          14141      19815190
col2                           14750      16697081
col3                            84450      253593687
col4                         1012477    2107593252
col5                          126058     186315175
col6                         13767      14640701
col7                        815991     730536864
col8                         31189      39598405
col9                             24733      91277637
col10                          171906     202392553
...
and there'll be quite a bit more coming in.
This kind of bulk insertion can also happen concurrently (I've set-up
an actor pool at five for the moment).
My questions are:
- is this the most performant approach, or would it make sense to e.g.
build one stream on the fly and somehow turn it into an inputstream to
be sent via add?
- is there a performance cost in adding with an ID? We don't really
need them since we retrieve records via a query - and those resources
aren't really files on the file-system
- is there a performance penalty in doing this kind of parsing concurrently?
- are there any JVM parameters that would help speed this up? I
haven't quite found how to pass in JVM parameters when starting
basexserver via the command line. Looks like BaseX gave itself an Xmx
of 1866006528 (but that machine has 8GB so it could in theory get
more.
Thanks!
Manuel