Hi Christian,
Thanks for the information!
When using OPTIMIZE it is not clear what caused the out of memory error. With individual CREATE INDEX statements I ran into the out of memory error on the FULLTEXT
index.
I have not tried incremental indexing with UPDINDEX or AUTOINDEX. My understanding from the documentation is that UPDINDEX does not update the full text index,
and incremental should be turned off to improve the speed of bulk imports. My insert script is using AUTOFLUSH false and calling db:flush for each document because each insert involves two actions on the index database (insert item metadata, and update the
current state of the import process), and also to ensure that each insert is atomic so that one problematic document does not affect other documents, and so that the import process can be interrupted and restarted.
I should clarify that the rates I mentioned include locating and retrieving each document in addition to inserting it into BaseX. Today I benchmarked ADD vs REPLACE
and have not seen much difference in speed.
Today I found the section on Index Performance in the documentation (http://docs.basex.org/wiki/Indexes#Performance).
This section mentions “If main memory runs out while creating a value index, the current index structures will be partially written to disk and eventually merged.” Does this mean that if running OPTIMIZE ALL ends with an out of memory error, running OPTIMIZE
as many times as needed will eventually update all of the indexes?
Thank you,
Vincent
From: Christian Grün [mailto:christian.gruen@gmail.com]
Sent: Wednesday, June 08, 2016 9:54 AM
To: Lizzi, Vincent <Vincent.Lizzi@taylorandfrancis.com>
Cc: BaseX <basex-talk@mailman.uni-konstanz.de>
Subject: Re: [basex-talk] Optimal size for large databases
Hi Vincent,
> My question is: what is an optimal maximum database size (in gigabytes)? I
> am hoping for some general advice. I appreciate that the answer can vary
> depending various factors.
Thanks for the precise details on your input data. Indeed it’s
difficult to give general advice on the best size of a database,
because it largely depends on the structure and contents of your
documents (see e.g. [1]). In our own applications, single databases
are rarely larger than 10 GB, but I know that some people are storing
more than 100 GB in single instances.
> 1. Running OPTIMIZE or OPTIMIZE ALL on the larger databases results in an
> out of memory error. I have switched to running CREATE INDEX to separately
> create text, attribute, token, and fulltext indexes, and found that creating
> these indexes separately produces fewer out of memory errors.
I would be interested in hearing which index structure causes most
problems in your scenario. Do you use the full-text index? The value
index structures should usually work fine with much less main memory
assigned.
> 2. The command scripts that run CREATE INDEX or OPTIMIZE (ALL) seem to tie
> up the machine for a long time, maybe due to heavy disk access.
This is probably something that cannot be avoided. Did you try
incremental indexing (via the UPDINDEX option)?
> 3. As the database grows in size the rate at which documents are added slows
> down. I have been measuring the number of documents imported, and observed
> rates over 100 documents per minute, and typical rates are around 60 – 30
> documents per minute.
I saw that you use REPLACE instead of ADD. If you know that your added
documents will be distinct, ADD will be much faster. You could
additionally set AUTOFLUSH to false, and do explicit flushes in
regular time intervals (or completely avoid it if robustness of your
data is not first priority).
Hope this helps,
Christian
[1] http://docs.basex.org/wiki/Statistics