I'm doing a BaseX evaluation to characterize its performance with my use case.
In my use case, I am receiving XML files over the network at a rate of about 100 / sec. I want to insert them into BaseX and perform real-time analytics. These XML files ranges from 20 to 50 nodes and in general has a similar strucuture. I will be periodically running XQuerys on this dataset.
I am using the network API in Python to add documents in a tight loop to determine throughput. I notice that when the database is empty, adding 1000 documents takes about 1 second. When the database is loaded up with 30,000 documents, adding these 1000 documents takes about 5 seconds. I was able to get much better scaling with AUTOFLUSH off, and periodically executing a flush command. So far with a million documents in the database, adding documents seem to take close to constant time which is great.
I haven't gotten to optimizing queries yet. Doing some simple aggregate queries like count(//), it looks like the query time scales linearly with the number of documents in the database.
Are there any general strategies to optimizing real-time analytics use cases? Any options that can be tuned to increase document insertion and query speeds and scaling? Or perhaps indexing options that might work better for a dataset that is always changing and increasing?
Thanks, - Simon