Dear Judith,
thanks for the e-mail. Yes, BaseX does have some upper, software-based limits; a single database must not contain more than 2^31 (~2 billion) XML nodes. Depending on the structure of the input documents, some input documents can be up to 500 gb. Regarding typical MedLine data (as you have already experienced), the upper limit is around 48 gb.
As a straightforward solution, I would indeed recommend to create several databases instances (at least two). These instances can be addressed by a single query, e.g. via…
for $d in 1 to 2 return doc(concat("medline", $d))//.....
Note, however, that if you plan e.g. to benefit from the database indexes, your queries might not be optimized the same way as for one database instance. In this case, you need to explicitly address all database instances, as shown below:
(doc("medline1")//*[text() = 'A'], doc("medline2")//*[text() = 'A'])
As we have just a small number of users that work on such large XML files (but we're glad to see that some people like you do…), we haven't increased the limit yet in favor of better execution times. To give some more technical background information: one of the reasons is that Java arrays can't be larger than 2^31 entries, as all array pointers are signed integers. Still, we're aware of solutions that will be realized as soon as we get more requests for XML instances of your order of magnitude.
Hope this helps, Christian ___________________________
Christian Gruen Universitaet Konstanz Department of Computer & Information Science D-78457 Konstanz, Germany Tel: +49 (0)7531/88-4449, Fax: +49 (0)7531/88-3577 http://www.inf.uni-konstanz.de/~gruen
On Wed, Jun 9, 2010 at 4:54 PM, judith.risse@wur.nl wrote:
Hello,
I'm trying to load a large dataset (Medline, 16.7 million records, 66Gb data on disk) into BaseX. The input data is divided into 563 files of 30000 records each. When I try to load the data into a database with the following command: create db path/dir medline_full I get the following error message: Error: "medline08n0507.xml" (Line 2620386): Document is too large for being processed.
The document itself can be loaded into a seperate database without problems and I also managed to load a merged file containing 1 million records, so the file size itself seems not to be the problem.
I used the gui implementation running on a quadcore server with SuSe 10.2 64bit and 8Gb of memory, although the java version is 32bit 1.6.0_18
Therefore my question, does BaseX have a datalimit, either on file-, record- or nodelevel? Or are there hardware limitiations that could result in such an error message?
With kind regards Judith _______________________________________________ BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk