Hello,
I'm trying to load a large dataset (Medline, 16.7 million records, 66Gb data on disk) into BaseX. The input data is divided into 563 files of 30000 records each. When I try to load the data into a database with the following command: create db path/dir medline_full I get the following error message: Error: "medline08n0507.xml" (Line 2620386): Document is too large for being processed.
The document itself can be loaded into a seperate database without problems and I also managed to load a merged file containing 1 million records, so the file size itself seems not to be the problem.
I used the gui implementation running on a quadcore server with SuSe 10.2 64bit and 8Gb of memory, although the java version is 32bit 1.6.0_18
Therefore my question, does BaseX have a datalimit, either on file-, record- or nodelevel? Or are there hardware limitiations that could result in such an error message?
With kind regards Judith
Dear Judith,
thanks for the e-mail. Yes, BaseX does have some upper, software-based limits; a single database must not contain more than 2^31 (~2 billion) XML nodes. Depending on the structure of the input documents, some input documents can be up to 500 gb. Regarding typical MedLine data (as you have already experienced), the upper limit is around 48 gb.
As a straightforward solution, I would indeed recommend to create several databases instances (at least two). These instances can be addressed by a single query, e.g. via…
for $d in 1 to 2 return doc(concat("medline", $d))//.....
Note, however, that if you plan e.g. to benefit from the database indexes, your queries might not be optimized the same way as for one database instance. In this case, you need to explicitly address all database instances, as shown below:
(doc("medline1")//*[text() = 'A'], doc("medline2")//*[text() = 'A'])
As we have just a small number of users that work on such large XML files (but we're glad to see that some people like you do…), we haven't increased the limit yet in favor of better execution times. To give some more technical background information: one of the reasons is that Java arrays can't be larger than 2^31 entries, as all array pointers are signed integers. Still, we're aware of solutions that will be realized as soon as we get more requests for XML instances of your order of magnitude.
Hope this helps, Christian ___________________________
Christian Gruen Universitaet Konstanz Department of Computer & Information Science D-78457 Konstanz, Germany Tel: +49 (0)7531/88-4449, Fax: +49 (0)7531/88-3577 http://www.inf.uni-konstanz.de/~gruen
On Wed, Jun 9, 2010 at 4:54 PM, judith.risse@wur.nl wrote:
Hello,
I'm trying to load a large dataset (Medline, 16.7 million records, 66Gb data on disk) into BaseX. The input data is divided into 563 files of 30000 records each. When I try to load the data into a database with the following command: create db path/dir medline_full I get the following error message: Error: "medline08n0507.xml" (Line 2620386): Document is too large for being processed.
The document itself can be loaded into a seperate database without problems and I also managed to load a merged file containing 1 million records, so the file size itself seems not to be the problem.
I used the gui implementation running on a quadcore server with SuSe 10.2 64bit and 8Gb of memory, although the java version is 32bit 1.6.0_18
Therefore my question, does BaseX have a datalimit, either on file-, record- or nodelevel? Or are there hardware limitiations that could result in such an error message?
With kind regards Judith _______________________________________________ BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk
Dear Christian,
I loaded all of Medline in 6 databases named medline_full_1 to 6 and created the full-text indexes. Now I want to query across all 6 databases using the perl module, but I can't get it to work properly. I want to adress all databases specifically as I want to have optimal performance as you already mentioned. This is what I came up with as query:
my $cmd = "xquery for $i in (basex:db('medline_full_1')/MedlineCitationSet/MedlineCitation, basex:db('medline_full_2')/MedlineCitationSet/MedlineCitation, basex:db('medline_full_3')/MedlineCitationSet/MedlineCitation, basex:db('medline_full_4')/MedlineCitationSet/MedlineCitation, basex:db('medline_full_5')/MedlineCitationSet/MedlineCitation, basex:db('medline_full_6')/MedlineCitationSet/MedlineCitation) for $p in $i/Article/AuthorList/Author/LastName where $p contains text 'Smith' return ($i/PMID, $i/Article/Abstract/AbstractText)";
It seems to work but only evaluates the first database. How do I make it evaluate all 6?
A slightly differen question, would using basex:index for the fulltext index speed up the query and how would I then go about implementing this.
Thanks in advance Judith
-----Original Message----- From: Christian Grün [mailto:christian.gruen@gmail.com] Sent: 09 June 2010 18:05 To: Risse, Judith Cc: basex-talk@mailman.uni-konstanz.de Subject: Re: [basex-talk] Data limit?
Dear Judith,
thanks for the e-mail. Yes, BaseX does have some upper, software-based limits; a single database must not contain more than 2^31 (~2 billion) XML nodes. Depending on the structure of the input documents, some input documents can be up to 500 gb. Regarding typical MedLine data (as you have already experienced), the upper limit is around 48 gb.
As a straightforward solution, I would indeed recommend to create several databases instances (at least two). These instances can be addressed by a single query, e.g. via.
for $d in 1 to 2 return doc(concat("medline", $d))//.....
Note, however, that if you plan e.g. to benefit from the database indexes, your queries might not be optimized the same way as for one database instance. In this case, you need to explicitly address all database instances, as shown below:
(doc("medline1")//*[text() = 'A'], doc("medline2")//*[text() = 'A'])
As we have just a small number of users that work on such large XML files (but we're glad to see that some people like you do.), we haven't increased the limit yet in favor of better execution times. To give some more technical background information: one of the reasons is that Java arrays can't be larger than 2^31 entries, as all array pointers are signed integers. Still, we're aware of solutions that will be realized as soon as we get more requests for XML instances of your order of magnitude.
Hope this helps, Christian ___________________________
Christian Gruen Universitaet Konstanz Department of Computer & Information Science D-78457 Konstanz, Germany Tel: +49 (0)7531/88-4449, Fax: +49 (0)7531/88-3577 http://www.inf.uni-konstanz.de/~gruen
On Wed, Jun 9, 2010 at 4:54 PM, judith.risse@wur.nl wrote:
Hello,
I'm trying to load a large dataset (Medline, 16.7 million records, 66Gb data on disk) into BaseX. The input data is divided into 563 files of 30000 records each. When I try to load the data into a database with the following command: create db path/dir medline_full I get the following error message: Error: "medline08n0507.xml" (Line 2620386): Document is too large for being processed.
The document itself can be loaded into a seperate database without problems and I also managed to load a merged file containing 1 million records, so the file size itself seems not to be the problem.
I used the gui implementation running on a quadcore server with SuSe 10.2 64bit and 8Gb of memory, although the java version is 32bit 1.6.0_18
Therefore my question, does BaseX have a datalimit, either on file-, record- or nodelevel? Or are there hardware limitiations that could result in such an error message?
With kind regards Judith _______________________________________________ BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk
Dear Judith,
How do I make it evaluate all 6?
i just tried out your query with 6 small databases and it returned results of all 6. Did you get any errors? You could probably give it a try with replacing all basex:db-functions with doc-functions, like
my $cmd = "xquery for $i in (doc('medline_full_1')/MedlineCitationSet/MedlineCitation, doc('medline_full_2')/MedlineCitationSet/MedlineCitation, doc('medline_full_3')/MedlineCitationSet/MedlineCitation, doc('medline_full_4')/MedlineCitationSet/MedlineCitation, doc('medline_full_5')/MedlineCitationSet/MedlineCitation, doc('medline_full_6')/MedlineCitationSet/MedlineCitation) for $p in $i/Article/AuthorList/Author/LastName where $p contains text 'Smith' return ($i/PMID, $i/Article/Abstract/AbstractText)";
Hope this helps, Andreas
Risse, Judith schrieb:
Dear Christian,
I loaded all of Medline in 6 databases named medline_full_1 to 6 and created the full-text indexes. Now I want to query across all 6 databases using the perl module, but I can't get it to work properly. I want to adress all databases specifically as I want to have optimal performance as you already mentioned. This is what I came up with as query:
my $cmd = "xquery for $i in (basex:db('medline_full_1')/MedlineCitationSet/MedlineCitation, basex:db('medline_full_2')/MedlineCitationSet/MedlineCitation, basex:db('medline_full_3')/MedlineCitationSet/MedlineCitation, basex:db('medline_full_4')/MedlineCitationSet/MedlineCitation, basex:db('medline_full_5')/MedlineCitationSet/MedlineCitation, basex:db('medline_full_6')/MedlineCitationSet/MedlineCitation) for $p in $i/Article/AuthorList/Author/LastName where $p contains text 'Smith' return ($i/PMID, $i/Article/Abstract/AbstractText)";
It seems to work but only evaluates the first database. How do I make it evaluate all 6?
A slightly differen question, would using basex:index for the fulltext index speed up the query and how would I then go about implementing this.
Thanks in advance Judith
-----Original Message----- From: Christian Grün [mailto:christian.gruen@gmail.com] Sent: 09 June 2010 18:05 To: Risse, Judith Cc: basex-talk@mailman.uni-konstanz.de Subject: Re: [basex-talk] Data limit?
Dear Judith,
thanks for the e-mail. Yes, BaseX does have some upper, software-based limits; a single database must not contain more than 2^31 (~2 billion) XML nodes. Depending on the structure of the input documents, some input documents can be up to 500 gb. Regarding typical MedLine data (as you have already experienced), the upper limit is around 48 gb.
As a straightforward solution, I would indeed recommend to create several databases instances (at least two). These instances can be addressed by a single query, e.g. via.
for $d in 1 to 2 return doc(concat("medline", $d))//.....
Note, however, that if you plan e.g. to benefit from the database indexes, your queries might not be optimized the same way as for one database instance. In this case, you need to explicitly address all database instances, as shown below:
(doc("medline1")//*[text() = 'A'], doc("medline2")//*[text() = 'A'])
As we have just a small number of users that work on such large XML files (but we're glad to see that some people like you do.), we haven't increased the limit yet in favor of better execution times. To give some more technical background information: one of the reasons is that Java arrays can't be larger than 2^31 entries, as all array pointers are signed integers. Still, we're aware of solutions that will be realized as soon as we get more requests for XML instances of your order of magnitude.
Hope this helps, Christian ___________________________
Christian Gruen Universitaet Konstanz Department of Computer & Information Science D-78457 Konstanz, Germany Tel: +49 (0)7531/88-4449, Fax: +49 (0)7531/88-3577 http://www.inf.uni-konstanz.de/~gruen
On Wed, Jun 9, 2010 at 4:54 PM, judith.risse@wur.nl wrote:
Hello,
I'm trying to load a large dataset (Medline, 16.7 million records, 66Gb data on disk) into BaseX. The input data is divided into 563 files of 30000 records each. When I try to load the data into a database with the following command: create db path/dir medline_full I get the following error message: Error: "medline08n0507.xml" (Line 2620386): Document is too large for being processed.
The document itself can be loaded into a seperate database without problems and I also managed to load a merged file containing 1 million records, so the file size itself seems not to be the problem.
I used the gui implementation running on a quadcore server with SuSe 10.2 64bit and 8Gb of memory, although the java version is 32bit 1.6.0_18
Therefore my question, does BaseX have a datalimit, either on file-, record- or nodelevel? Or are there hardware limitiations that could result in such an error message?
With kind regards Judith _______________________________________________ BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk
BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk
basex-talk@mailman.uni-konstanz.de