Hi,
we're using BaseX to store multiple collections of documents (we call
them records).
These record are produced programmatically, by parsing an incoming
stream on a server application and turning it into a document of the
kind
<record id="123" version="1">
...
</record>
So far I took the following approach:
- each collection of records is its own database in BaseX, for easier management
- on insertion
- set the session's autoflush to false
- iterate over record
- add them via add(id, document)
- each 10000 records, flush
- finally, flush once more
- create the attributes index
So for example now we have:
name Resources Size
Input Path
------------------------------------------------------------------------------------
col1 14141 19815190
col2 14750 16697081
col3 84450 253593687
col4 1012477 2107593252
col5 126058 186315175
col6 13767 14640701
col7 815991 730536864
col8 31189 39598405
col9 24733 91277637
col10 171906 202392553
...
and there'll be quite a bit more coming in.
This kind of bulk insertion can also happen concurrently (I've set-up
an actor pool at five for the moment).
My questions are:
- is this the most performant approach, or would it make sense to e.g.
build one stream on the fly and somehow turn it into an inputstream to
be sent via add?
- is there a performance cost in adding with an ID? We don't really
need them since we retrieve records via a query - and those resources
aren't really files on the file-system
- is there a performance penalty in doing this kind of parsing concurrently?
- are there any JVM parameters that would help speed this up? I
haven't quite found how to pass in JVM parameters when starting
basexserver via the command line. Looks like BaseX gave itself an Xmx
of 1866006528 (but that machine has 8GB so it could in theory get
more.
Thanks!
Manuel