Re: [basex-talk] Problems with bigger amounts of data

25 Jun 2011


      Hi Mathias,
thanks for your inquiry. In BaseX, adding documents and collections to
an existing database will be slower than creating them bulk-wise, so I
would generally recommend to use the Database -> New command, or
"create db" on command line, and specify the directory to be parsed.
This way, you should be able to completely build your database. If I'm
wrong, you should try to deactive text and attribute indexing
(Database -> New -> Indexes) and build the index structures afterwards
via Database -> Properties.
If database creation still fails, please provide us with more details:
– does the problem occur during the initial build step (which should
take quasi-constant memory) or during indexing texts, attributes, or
full-text?
– which version of BaseX are you working with?
– does the problem persist with the latest snapshot [1]?
Hope this helps,
Christian
[1] http://files.basex.org/releases/latest/
___________________________
BaseX Team
Christian Grün
Uni KN, Box 188
78457 Konstanz, Germany
http://www.basex.org
On Sat, Jun 25, 2011 at 4:50 PM, Mathias K mathias.kahl@googlemail.com wrote:
...
Hello everyone!
My name is Mathias. I'm using BaseX for an university project where we are
creating a publication database. Right now we have 25 GB xml data spread
over 180k documents. Ultimately I want to be able to perform Xquery searches
on this data, possibly even full text.
I'd like to know whether you think that BaseX is at all suitable for this
amount of data. If yes, how would I add these files to the database
optimally? If I use the BaseX GUI to add the folder an OutOfMemoryException
is produced shortly after starting the process. Even providing more RAM (~7
GB via -Xmx7000M) only delays this. I haven't looked at the code but it
appears as though all file contents are stored in RAM and are only written
to hard disk at the end, which would at least explain the huge amounts of
memory BaseX consumes.
Since the GUI can't handle the files I wrote an importer myself which
consecutively adds single files via the "Add" command. This seems to work
without memory excess. However, it is taking ages to add all 180.000 files
this way (several hours, haven't completed it yet). Maybe it's just further
delaying the overflow since it's so slow. Also, this might just be my
subjective feeling, but adding files seems to get slower as the database
grows. Is there some kind of duplicate check going an that could be in the
way? If yes, is there a way just to bulk insert all the data I got without
checks?
I'd be grateful for any thoughts on this!
Thanks in advance,
 Mathias
_______________________________________________
BaseX-Talk mailing list
BaseX-Talk@mailman.uni-konstanz.de
https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

Re: [basex-talk] Problems with bigger amounts of data