Hello everyone!
My name is Mathias. I'm using BaseX for an university project where we are creating a publication database. Right now we have 25 GB xml data spread over 180k documents. Ultimately I want to be able to perform Xquery searches on this data, possibly even full text.
I'd like to know whether you think that BaseX is at all suitable for this amount of data. If yes, how would I add these files to the database optimally? If I use the BaseX GUI to add the folder an OutOfMemoryException is produced shortly after starting the process. Even providing more RAM (~7 GB via -Xmx7000M) only delays this. I haven't looked at the code but it appears as though all file contents are stored in RAM and are only written to hard disk at the end, which would at least explain the huge amounts of memory BaseX consumes.
Since the GUI can't handle the files I wrote an importer myself which consecutively adds single files via the "Add" command. This seems to work without memory excess. However, it is taking ages to add all 180.000 files this way (several hours, haven't completed it yet). Maybe it's just further delaying the overflow since it's so slow. Also, this might just be my subjective feeling, but adding files seems to get slower as the database grows. Is there some kind of duplicate check going an that could be in the way? If yes, is there a way just to bulk insert all the data I got without checks?
I'd be grateful for any thoughts on this!
Thanks in advance,
Mathias
Hi Mathias,
thanks for your inquiry. In BaseX, adding documents and collections to an existing database will be slower than creating them bulk-wise, so I would generally recommend to use the Database -> New command, or "create db" on command line, and specify the directory to be parsed. This way, you should be able to completely build your database. If I'm wrong, you should try to deactive text and attribute indexing (Database -> New -> Indexes) and build the index structures afterwards via Database -> Properties.
If database creation still fails, please provide us with more details:
– does the problem occur during the initial build step (which should take quasi-constant memory) or during indexing texts, attributes, or full-text? – which version of BaseX are you working with? – does the problem persist with the latest snapshot [1]?
Hope this helps, Christian
[1] http://files.basex.org/releases/latest/ ___________________________
BaseX Team Christian Grün Uni KN, Box 188 78457 Konstanz, Germany http://www.basex.org
On Sat, Jun 25, 2011 at 4:50 PM, Mathias K mathias.kahl@googlemail.com wrote:
Hello everyone! My name is Mathias. I'm using BaseX for an university project where we are creating a publication database. Right now we have 25 GB xml data spread over 180k documents. Ultimately I want to be able to perform Xquery searches on this data, possibly even full text. I'd like to know whether you think that BaseX is at all suitable for this amount of data. If yes, how would I add these files to the database optimally? If I use the BaseX GUI to add the folder an OutOfMemoryException is produced shortly after starting the process. Even providing more RAM (~7 GB via -Xmx7000M) only delays this. I haven't looked at the code but it appears as though all file contents are stored in RAM and are only written to hard disk at the end, which would at least explain the huge amounts of memory BaseX consumes. Since the GUI can't handle the files I wrote an importer myself which consecutively adds single files via the "Add" command. This seems to work without memory excess. However, it is taking ages to add all 180.000 files this way (several hours, haven't completed it yet). Maybe it's just further delaying the overflow since it's so slow. Also, this might just be my subjective feeling, but adding files seems to get slower as the database grows. Is there some kind of duplicate check going an that could be in the way? If yes, is there a way just to bulk insert all the data I got without checks?
I'd be grateful for any thoughts on this!
Thanks in advance, Mathias _______________________________________________ BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk
Hello Christian!
Thanks for your quick answer!
As you suggested I tried using the "new" command. I wasn't successfull so far, because I encountered a number of other problems during the process. Since overall the db creation process lasts several hours with these amounts of data the time till some of the errors/problems surfaced where equally long. (Invalid filenames or contents of some files)
Nevertheless today I got it running without OutOfMemoryExceptions or other printed errors. Unfortunately though, when I executed the "create db OAI [folder]" command in the BaseXClient (over ssh on my server) it obviously never finished. Even though there were no error messages, I suppose something went wrong at some point, since after some time the process didn't show any activity any more (no CPU usage). When trying to access the database through a second BaseXClient instance the commands were not executed, or at least the client didn't show any results. The database was probably locked. In the end I just killed the processes and nothing was created.
Next I will just try to get it running locally on my computer via the BaseX GUI and follow your advice to disable index creation etc. and see if it works.
To answer your questions: - It seems to appear during the build step, since I never saw anything being said about indices - I'm working with BaseX 6.6 via Maven - I have yet to try the latest snapshot
I hope I'll get it working ^_^ My fallback solution is adding the files individually, even though that will probably take a whole day.
~ Mathias
2011/6/25 Christian Grün christian.gruen@gmail.com
Hi Mathias,
thanks for your inquiry. In BaseX, adding documents and collections to an existing database will be slower than creating them bulk-wise, so I would generally recommend to use the Database -> New command, or "create db" on command line, and specify the directory to be parsed. This way, you should be able to completely build your database. If I'm wrong, you should try to deactive text and attribute indexing (Database -> New -> Indexes) and build the index structures afterwards via Database -> Properties.
If database creation still fails, please provide us with more details:
– does the problem occur during the initial build step (which should take quasi-constant memory) or during indexing texts, attributes, or full-text? – which version of BaseX are you working with? – does the problem persist with the latest snapshot [1]?
Hope this helps, Christian
[1] http://files.basex.org/releases/latest/ ___________________________
BaseX Team Christian Grün Uni KN, Box 188 78457 Konstanz, Germany http://www.basex.org
On Sat, Jun 25, 2011 at 4:50 PM, Mathias K mathias.kahl@googlemail.com wrote:
Hello everyone! My name is Mathias. I'm using BaseX for an university project where we
are
creating a publication database. Right now we have 25 GB xml data spread over 180k documents. Ultimately I want to be able to perform Xquery
searches
on this data, possibly even full text. I'd like to know whether you think that BaseX is at all suitable for this amount of data. If yes, how would I add these files to the database optimally? If I use the BaseX GUI to add the folder an
OutOfMemoryException
is produced shortly after starting the process. Even providing more RAM
(~7
GB via -Xmx7000M) only delays this. I haven't looked at the code but it appears as though all file contents are stored in RAM and are only
written
to hard disk at the end, which would at least explain the huge amounts of memory BaseX consumes. Since the GUI can't handle the files I wrote an importer myself which consecutively adds single files via the "Add" command. This seems to work without memory excess. However, it is taking ages to add all 180.000
files
this way (several hours, haven't completed it yet). Maybe it's just
further
delaying the overflow since it's so slow. Also, this might just be my subjective feeling, but adding files seems to get slower as the database grows. Is there some kind of duplicate check going an that could be in
the
way? If yes, is there a way just to bulk insert all the data I got
without
checks?
I'd be grateful for any thoughts on this!
Thanks in advance, Mathias _______________________________________________ BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk
Hi Mathias,
As you suggested I tried using the "new" command. I wasn't successfull so far, because I encountered a number of other problems during the process. Since overall the db creation process lasts several hours with these amounts of data the time till some of the errors/problems surfaced where equally long. (Invalid filenames or contents of some files)
True; you need to ensure that all XML documents are well-formed. You might as well use xmllint or similar tools to remove those files in advance.
Nevertheless today I got it running without OutOfMemoryExceptions or other printed errors. Unfortunately though, when I executed the "create db OAI [folder]" command in the BaseXClient (over ssh on my server) it obviously never finished.
That's quite an unusual behavior; I guess that too many URLs are resolved again and again, which might take lots of lots of time. I'd advise to set the intparse flag to true (set intparse on; create db ..., or Database -> New -> Parsing -> Use Internal Parser), or deactivate DTD parsing. If you need to do DTD handling, e.g. to resolve entities, you could as well specify a Catalog Resolver (http://docs.basex.org/wiki/Catalog_Resolver). Once more, I recommend to use the latest snapshot, this will simplify tracing down the cause of the problem.
Feel free to ask for more, Christian
Hello again!
Creating a database without indices from 28 GB of XML files using the BaseX GUI worked on my computer and took only 3 and a half hours. To be safe I provided it with 8 GB RAM. Right now BaseX consumes 6 GB with the GUI, so I suppose it would never have worked on my V-Server (only 4 GB RAM) anyway.
Thanks for your kind help!
~ Mathias
2011/6/28 Christian Grün christian.gruen@gmail.com
Hi Mathias,
As you suggested I tried using the "new" command. I wasn't successfull so far, because I encountered a number of other problems during the process. Since overall the db creation process lasts several hours with these
amounts
of data the time till some of the errors/problems surfaced where equally long. (Invalid filenames or contents of some files)
True; you need to ensure that all XML documents are well-formed. You might as well use xmllint or similar tools to remove those files in advance.
Nevertheless today I got it running without OutOfMemoryExceptions or
other
printed errors. Unfortunately though, when I executed the "create db OAI [folder]" command in the BaseXClient (over ssh on my server) it obviously never finished.
That's quite an unusual behavior; I guess that too many URLs are resolved again and again, which might take lots of lots of time. I'd advise to set the intparse flag to true (set intparse on; create db ..., or Database -> New -> Parsing -> Use Internal Parser), or deactivate DTD parsing. If you need to do DTD handling, e.g. to resolve entities, you could as well specify a Catalog Resolver (http://docs.basex.org/wiki/Catalog_Resolver). Once more, I recommend to use the latest snapshot, this will simplify tracing down the cause of the problem.
Feel free to ask for more, Christian
basex-talk@mailman.uni-konstanz.de