I am importing a large (~3 million) set of XML documents to BaseX and am running into some problems as the databases grow beyond a few gigabytes. I am using BaseX 8.4.4, and have increased the memory available to BaseX to 12288m (12 Gb) using -Xmx (the machine has 20 Gb total). The documents are being stored in several databases that range in size from 300 Mb up to 25 Gb.
My question is: what is an optimal maximum database size (in gigabytes)? I am hoping for some general advice. I appreciate that the answer can vary depending various factors.
The problems that I've encountered with the larger databases are:
1. Running OPTIMIZE or OPTIMIZE ALL on the larger databases results in an out of memory error. I have switched to running CREATE INDEX to separately create text, attribute, token, and fulltext indexes, and found that creating these indexes separately produces fewer out of memory errors. I would like to be able to use OPTIMIZE ALL because over time some documents will be removed from the databases and the documentation indicates that optimize all will remove stale information from the indexes.
2. The command scripts that run CREATE INDEX or OPTIMIZE (ALL) seem to tie up the machine for a long time, maybe due to heavy disk access.
3. As the database grows in size the rate at which documents are added slows down. I have been measuring the number of documents imported, and observed rates over 100 documents per minute, and typical rates are around 60 - 30 documents per minute. As the database grows over a few gigabytes the speed slows to around 20 documents per minute. This is not much of a problem because when I see the rate slow down I can start a new database. Unfortunately I have been recording the number of documents, not the database size.
In case this information is useful, my project is structured as follows:
* There is 1 central index database which records for each document the BaseX database name and path where a document is stored, and some metadata that we use to identify or locate documents.
* There are multiple content databases to store the actual documents. These content databases are organized by DTD and time period.
* Each insert is done using the BaseX REST API. A BaseX HTTP server instance is running to receive documents, and a basex instance is running from the command line to locate and provide documents. Each insert is done by a POST that includes data and an updating query which adds (using db:replace) to the central index database and a document database in one "transaction". This helps to make the import resilient to duplicate documents, and to any problem that can prevent a single from document being added, and allows the process to continue if interrupted.
I will probably need to re-organize the content databases so that each database is only a few gigabytes in size. Does anyone have advice on what would be a good maximum size for each database?
Thanks, Vincent
Vincent M. Lizzi - Electronic Production Manager Taylor & Francis Group 530 Walnut St., Suite 850, Philadelphia, PA 19106 E-Mail: vincent.lizzi@taylorandfrancis.commailto:vincent.lizzi@taylorandfrancis.com Phone: 215-606-4221 Fax: 215-207-0047 Web: http://www.tandfonline.com/
Taylor & Francis is a trading name of Informa UK Limited, registered in England under no. 1072954
"Everything should be made as simple as possible, but not simpler."
Hi Vincent,
My question is: what is an optimal maximum database size (in gigabytes)? I am hoping for some general advice. I appreciate that the answer can vary depending various factors.
Thanks for the precise details on your input data. Indeed it’s difficult to give general advice on the best size of a database, because it largely depends on the structure and contents of your documents (see e.g. [1]). In our own applications, single databases are rarely larger than 10 GB, but I know that some people are storing more than 100 GB in single instances.
- Running OPTIMIZE or OPTIMIZE ALL on the larger databases results in an
out of memory error. I have switched to running CREATE INDEX to separately create text, attribute, token, and fulltext indexes, and found that creating these indexes separately produces fewer out of memory errors.
I would be interested in hearing which index structure causes most problems in your scenario. Do you use the full-text index? The value index structures should usually work fine with much less main memory assigned.
- The command scripts that run CREATE INDEX or OPTIMIZE (ALL) seem to tie
up the machine for a long time, maybe due to heavy disk access.
This is probably something that cannot be avoided. Did you try incremental indexing (via the UPDINDEX option)?
- As the database grows in size the rate at which documents are added slows
down. I have been measuring the number of documents imported, and observed rates over 100 documents per minute, and typical rates are around 60 – 30 documents per minute.
I saw that you use REPLACE instead of ADD. If you know that your added documents will be distinct, ADD will be much faster. You could additionally set AUTOFLUSH to false, and do explicit flushes in regular time intervals (or completely avoid it if robustness of your data is not first priority).
Hope this helps, Christian
Hi Christian,
Thanks for the information!
When using OPTIMIZE it is not clear what caused the out of memory error. With individual CREATE INDEX statements I ran into the out of memory error on the FULLTEXT index.
I have not tried incremental indexing with UPDINDEX or AUTOINDEX. My understanding from the documentation is that UPDINDEX does not update the full text index, and incremental should be turned off to improve the speed of bulk imports. My insert script is using AUTOFLUSH false and calling db:flush for each document because each insert involves two actions on the index database (insert item metadata, and update the current state of the import process), and also to ensure that each insert is atomic so that one problematic document does not affect other documents, and so that the import process can be interrupted and restarted.
I should clarify that the rates I mentioned include locating and retrieving each document in addition to inserting it into BaseX. Today I benchmarked ADD vs REPLACE and have not seen much difference in speed.
Today I found the section on Index Performance in the documentation (http://docs.basex.org/wiki/Indexes#Performance). This section mentions “If main memory runs out while creating a value index, the current index structures will be partially written to disk and eventually merged.” Does this mean that if running OPTIMIZE ALL ends with an out of memory error, running OPTIMIZE as many times as needed will eventually update all of the indexes?
Thank you, Vincent
From: Christian Grün [mailto:christian.gruen@gmail.com] Sent: Wednesday, June 08, 2016 9:54 AM To: Lizzi, Vincent Vincent.Lizzi@taylorandfrancis.com Cc: BaseX basex-talk@mailman.uni-konstanz.de Subject: Re: [basex-talk] Optimal size for large databases
Hi Vincent,
My question is: what is an optimal maximum database size (in gigabytes)? I am hoping for some general advice. I appreciate that the answer can vary depending various factors.
Thanks for the precise details on your input data. Indeed it’s difficult to give general advice on the best size of a database, because it largely depends on the structure and contents of your documents (see e.g. [1]). In our own applications, single databases are rarely larger than 10 GB, but I know that some people are storing more than 100 GB in single instances.
- Running OPTIMIZE or OPTIMIZE ALL on the larger databases results in an
out of memory error. I have switched to running CREATE INDEX to separately create text, attribute, token, and fulltext indexes, and found that creating these indexes separately produces fewer out of memory errors.
I would be interested in hearing which index structure causes most problems in your scenario. Do you use the full-text index? The value index structures should usually work fine with much less main memory assigned.
- The command scripts that run CREATE INDEX or OPTIMIZE (ALL) seem to tie
up the machine for a long time, maybe due to heavy disk access.
This is probably something that cannot be avoided. Did you try incremental indexing (via the UPDINDEX option)?
- As the database grows in size the rate at which documents are added slows
down. I have been measuring the number of documents imported, and observed rates over 100 documents per minute, and typical rates are around 60 – 30 documents per minute.
I saw that you use REPLACE instead of ADD. If you know that your added documents will be distinct, ADD will be much faster. You could additionally set AUTOFLUSH to false, and do explicit flushes in regular time intervals (or completely avoid it if robustness of your data is not first priority).
Hope this helps, Christian
[1] http://docs.basex.org/wiki/Statisticshttp://docs.basex.org/wiki/Statistics
Hi Lizzi,
Thanks for the information!
And thanks back for the details.
When using OPTIMIZE it is not clear what caused the out of memory error. With individual CREATE INDEX statements I ran into the out of memory error on the FULLTEXT index.
I see; in both cases, it must be the fulltext index. As you have discovered in the documentation, we are writing partial index structures to disk once main memory gets exhausted. This works very well for the text and attribude index (you can usually index 10 GB of data with 100 MB of main memory), but it gets more and more clear that the corresponding merge algorithms need to be improved and better adapted to the full-text index.
Did you try selective indexing (i.e., limit the indexed fulltext nodes to the ones that will eventually be queried)?
I have not tried incremental indexing with UPDINDEX or AUTOINDEX. My understanding from the documentation is that UPDINDEX does not update the full text index, and incremental should be turned off to improve the speed of bulk imports.
Completely true; in your case, it doesn’t really help.
Today I benchmarked ADD vs REPLACE and have not seen much difference in speed.
Once a REPLACE is called, additional meta data structures will be created that need to be maintained, so it could be that you will need to start from scratch with a new database.
Today I found the section on Index Performance in the documentation (http://docs.basex.org/wiki/Indexes#Performance). This section mentions “If main memory runs out while creating a value index, the current index structures will be partially written to disk and eventually merged.” Does this mean that if running OPTIMIZE ALL ends with an out of memory error, running OPTIMIZE as many times as needed will eventually update all of the indexes?
Once you get out of memory, indexing will be interrupted and needs to be done again.
Hope this helps, Christian
basex-talk@mailman.uni-konstanz.de