Hi,
we're using BaseX to store multiple collections of documents (we call them records).
These record are produced programmatically, by parsing an incoming stream on a server application and turning it into a document of the kind
<record id="123" version="1"> ... </record>
So far I took the following approach:
- each collection of records is its own database in BaseX, for easier management
- on insertion - set the session's autoflush to false - iterate over record - add them via add(id, document) - each 10000 records, flush - finally, flush once more - create the attributes index
So for example now we have:
name Resources Size Input Path ------------------------------------------------------------------------------------ col1 14141 19815190 col2 14750 16697081 col3 84450 253593687 col4 1012477 2107593252 col5 126058 186315175 col6 13767 14640701 col7 815991 730536864 col8 31189 39598405 col9 24733 91277637 col10 171906 202392553 ...
and there'll be quite a bit more coming in.
This kind of bulk insertion can also happen concurrently (I've set-up an actor pool at five for the moment).
My questions are:
- is this the most performant approach, or would it make sense to e.g. build one stream on the fly and somehow turn it into an inputstream to be sent via add? - is there a performance cost in adding with an ID? We don't really need them since we retrieve records via a query - and those resources aren't really files on the file-system - is there a performance penalty in doing this kind of parsing concurrently? - are there any JVM parameters that would help speed this up? I haven't quite found how to pass in JVM parameters when starting basexserver via the command line. Looks like BaseX gave itself an Xmx of 1866006528 (but that machine has 8GB so it could in theory get more.
Thanks!
Manuel
Hi Manuel,
thanks for your e-mail.
- is this the most performant approach, or would it make sense to e.g.
build one stream on the fly and somehow turn it into an inputstream to be sent via add?
I'd say that your approach is close to an optimal solution, as the ADD command is pretty cheap, compared to e.g. REPLACE. If you believe that you could still run into some bottlenecks, you could have a look at, or provide us, with the output of Java's profiler (e.g. -Xrunhprof:cpu=samples),
- is there a performance cost in adding with an ID? We don't really
need them since we retrieve records via a query - and those resources aren't really files on the file-system
Currently, there will be no additional costs, as we don't perform any checks if a document with the same name/path has already been stored.
- is there a performance penalty in doing this kind of parsing concurrently?
Concurrent operations will be managed by the central transaction manager. At the time of writing this, all write operations are performed one after another, but in near future, concurrent write operations to different databases will also be run in parallel.
- are there any JVM parameters that would help speed this up? I
In general, Java will be faster when run with -server, but this option may have been chosen anyway by your Java runtime. Regarding the maximum amount of memory, there shouldn't be any noteworthy differences when adding documents.
Hope this helps, Christian
Hi Christian,
I'd say that your approach is close to an optimal solution, as the ADD command is pretty cheap, compared to e.g. REPLACE. If you believe that you could still run into some bottlenecks, you could have a look at, or provide us, with the output of Java's profiler (e.g. -Xrunhprof:cpu=samples),
Ok, I will look into this if we get bitten by performance issues (the longer collections do usually take a fair amount of time to be inserted, at least concurrently).
- is there a performance penalty in doing this kind of parsing concurrently?
Concurrent operations will be managed by the central transaction manager. At the time of writing this, all write operations are performed one after another, but in near future, concurrent write operations to different databases will also be run in parallel.
Excellent news. I noticed things were slowing down when we had multiple collections inserted at the same time, so this should probably help.
- are there any JVM parameters that would help speed this up? I
In general, Java will be faster when run with -server, but this option may have been chosen anyway by your Java runtime. Regarding the maximum amount of memory, there shouldn't be any noteworthy differences when adding documents.
Hope this helps, Christian
Thanks!
Manuel
basex-talk@mailman.uni-konstanz.de