Responding to the last question about querying over collections: I had the same issue and Lukas Kircher provided the answer - http://www.mail-archive.com/basex-talk@mailman.uni-konstanz.de/msg0210 0.html
---- Original Message ---- From: kqfhjjgrn@yahoo.es To: christian.gruen@gmail.com, fetanchaud@questel.com Subject: Re: [basex-talk] Adding millions of XML files Date: Mon, 15 Apr 2013 13:12:55 +0100 (BST)
Worked! :-)
I have uninstalled 7.6 and installed 7.7 beta. Then, created the
empty db, added the 3 files, run the "set addcache true" command, added the 17828 files... and no "out of memory" error, just the processing info:
Path "everything" added in 462943.7 ms.
that is ~8 minutes (in my development machine, not in our server).
Now I'm going to do some more tests (both for adding and for
quering), and I'm going to try the "basex" command, in order to add XML files automatically to the db.
Anyway I would ask some more questions:
- Is 7.7 beta sufficiently stable to be used in our production
server? Shoud I wait for the final 7.7 release?
- Is the "addcache" property value permanently saved to the db?
Should I run the "set addcache true" command everytime I add files?
- Should I keep disabled the Text & Attribute indexes? Is the
"addcache=on" option sufficient to allow the adition of XML files, so I can enable those indexes? Will my queries be slow with those indexes disabled?
- Should I run Optimize after every massive insertion (even with
"addcache=on")?
Thank you for the information on limits, it is very useful. In
particular, the following limits:
FileSize: 512 GiB #Files: 536,870,912
mean a medium value of exactly 1 KB/file. Since my files are bigger
than 1 KB (in medium), then the size limit will be reached first (512 GiB). So my Perl scripts will have to detect the size of the db, and if it is bigger than ~500 GB, then they will create a new db and add new XML files to it.
Please show me an easy example of how to use several databases in
the same query. Perhaps something like:
for $doc in (collection("db1"), collection("db2")) for $node in $doc/$a_node_path etc...
Well, thank you very much for your help. And excuse me for the huge
amount of questions from a newbie like me :-)
freesoft
De: Christian Grün christian.gruen@gmail.com Para: Fabrice Etanchaud fetanchaud@questel.com CC: freesoft kqfhjjgrn@yahoo.es;
"basex-talk@mailman.uni-konstanz.de" basex-talk@mailman.uni-konstanz.de
Enviado: Lunes 15 de abril de 2013 12:12 Asunto: Re: [basex-talk] Adding millions of XML files
Hi kgfhjjgrn,
I believe that Fabrice already mentioned all details that should
help
you to build larger databases. The ADDCACHE option [1] (included in the latest stable snapshot [2]) may already be sufficient to add
your
documents via the GUI: simply run the "set addcache true" command
via
the input bar of the main window before opening the Properties
dialog.
Note that you can access multiple databases with a single XQuery
call,
so if you know that youll exceed the limits of a single database at some time (see [3]), simply create new databases in certain
intervals.
Hope this helps, Christian
[1] http://docs.basex.org/wiki/Options#ADDCACHE [2] http://files.basex.org/releases/latest/ [3] http://docs.basex.org/wiki/Statistics _________________________________________
The size of your test should not cause any problem to BaseX (18
000 files
from 1 up to 5 KB)
1. Did you try to set the ADDCACHE option ?
2. You should OPTIMIZE your collection after each batch of
ADD
commands, even if no index is set.
3. Did you try to unset the AUTOFLUSH option, and explicitly
FLUSH the
updates at batchs end ?
4. The GUI may not be the best place to run updates, did you
try the
basex command line tools ?
Opening a collection containing a huge number of documents may
take a long
time from my experience.
It seems to be related to the kind of memory data structure used
to store
the document names.
A workaround could be to insert your documents under a common root
xml
element with XQuery Update.
Best,
Fabrice Etanchaud
Questel-Orbit
De : basex-talk-bounces@mailman.uni-konstanz.de [mailto:basex-talk-bounces@mailman.uni-konstanz.de] De la part de
freesoft
Envoyé : lundi 15 avril 2013 10:19 À : basex-talk@mailman.uni-konstanz.de Objet : [basex-talk] Adding millions of XML files
Hi, I'm new to BaseX and to XQuery. I already knew XPath. I'm
evaluating
BaseX to store our XML files and make queries on them. We have to
store
about 1 million of XML files per month. The XML files are little
(~1 KB to 5
KB). So our case is: High number of files, little size.
I've read that BaseX is scalable and has high performance, so it
is probably
a good tool for us. But, in the tests I'm doing, I'm getting an
"Out of Main
Memory" error when loading high number of XML files.
For exaple, if I create a new database ("testdb"), and add 3 XML
files, no
problem occurs. Files are stored correctly, and I can make queries
on them.
Then, if I try to add 18000 XML files to the same database
("testdb") (by
using GUI > Database > Properties > Add Resources), then I see how
the
coloured memory bar grows and grows... until an error appears:
Out of Main Memory. You can try to: - increase Java's heap size with the flag -Xmx<size> - deactivate the text and attribute indexes.
The text and attribute indexes are disabled, so it is not the
cause. And I
increased the Java size with the flag -Xmx<size> (by editing the basexgui.bat script), and same error happens.
Probaly BaseX loads all files to main memory first, and then dumps
them to
the database files. That shouldn't be done in that way. For each
XML file,
it should be loaded into main memory, then procesed and then
dumped to the
db files. For each file, independently from the rest.
So I have two questions:
- Do I have to use an special way to add high number of XML
files?
- Is BaseX sufficiently stable to store and manage our data
(about 1
million of files will be added per month)?
Thank you for our help and for your great software, and excuse me
if I am
asking for recurrent questions.
BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk
Thank you. The db:list() function looks useful for automatically looping over all the database instances:
for $db in db:list() for $doc in collection($db) for $node in $doc/$a_node_path ...
freesoft
________________________________ De: "pw@themail.co.uk" pw@themail.co.uk Para: kqfhjjgrn@yahoo.es; christian.gruen@gmail.com; fetanchaud@questel.com CC: basex-talk@mailman.uni-konstanz.de Enviado: Lunes 15 de abril de 2013 23:55 Asunto: Re: [basex-talk] Adding millions of XML files
Responding to the last question about querying over collections: I had the same issue and Lukas Kircher provided the answer - http://www.mail-archive.com/basex-talk@mailman.uni-konstanz.de/msg0210 0.html
---- Original Message ---- From: kqfhjjgrn@yahoo.es To: christian.gruen@gmail.com, fetanchaud@questel.com Subject: Re: [basex-talk] Adding millions of XML files Date: Mon, 15 Apr 2013 13:12:55 +0100 (BST)
Worked! :-)
I have uninstalled 7.6 and installed 7.7 beta. Then, created the
empty db, added the 3 files, run the "set addcache true" command, added the 17828 files... and no "out of memory" error, just the processing info:
Path "everything" added in 462943.7 ms.
that is ~8 minutes (in my development machine, not in our server).
Now I'm going to do some more tests (both for adding and for
quering), and I'm going to try the "basex" command, in order to add XML files automatically to the db.
Anyway I would ask some more questions:
- Is 7.7 beta sufficiently stable to be used in our production
server? Shoud I wait for the final 7.7 release?
- Is the "addcache" property value permanently saved to the db?
Should I run the "set addcache true" command everytime I add files?
- Should I keep disabled the Text & Attribute indexes? Is the
"addcache=on" option sufficient to allow the adition of XML files, so I can enable those indexes? Will my queries be slow with those indexes disabled?
- Should I run Optimize after every massive insertion (even with
"addcache=on")?
Thank you for the information on limits, it is very useful. In
particular, the following limits:
FileSize: 512 GiB #Files: 536,870,912
mean a medium value of exactly 1 KB/file. Since my files are bigger
than 1 KB (in medium), then the size limit will be reached first (512 GiB). So my Perl scripts will have to detect the size of the db, and if it is bigger than ~500 GB, then they will create a new db and add new XML files to it.
Please show me an easy example of how to use several databases in
the same query. Perhaps something like:
for $doc in (collection("db1"), collection("db2")) for $node in $doc/$a_node_path etc...
Well, thank you very much for your help. And excuse me for the huge
amount of questions from a newbie like me :-)
freesoft
De: Christian Grün christian.gruen@gmail.com Para: Fabrice Etanchaud fetanchaud@questel.com CC: freesoft kqfhjjgrn@yahoo.es;
"basex-talk@mailman.uni-konstanz.de" basex-talk@mailman.uni-konstanz.de
Enviado: Lunes 15 de abril de 2013 12:12 Asunto: Re: [basex-talk] Adding millions of XML files
Hi kgfhjjgrn,
I believe that Fabrice already mentioned all details that should
help
you to build larger databases. The ADDCACHE option [1] (included in the latest stable snapshot [2]) may already be sufficient to add
your
documents via the GUI: simply run the "set addcache true" command
via
the input bar of the main window before opening the Properties
dialog.
Note that you can access multiple databases with a single XQuery
call,
so if you know that you’ll exceed the limits of a single database at some time (see [3]), simply create new databases in certain
intervals.
Hope this helps, Christian
[1] http://docs.basex.org/wiki/Options#ADDCACHE [2] http://files.basex.org/releases/latest/ [3] http://docs.basex.org/wiki/Statistics _________________________________________
The size of your test should not cause any problem to BaseX (18
000 files
from 1 up to 5 KB)
1. Did you try to set the ADDCACHE option ?
2. You should OPTIMIZE your collection after each batch of
ADD
commands, even if no index is set.
3. Did you try to unset the AUTOFLUSH option, and explicitly
FLUSH the
updates at batch’s end ?
4. The GUI may not be the best place to run updates, did you
try the
basex command line tools ?
Opening a collection containing a huge number of documents may
take a long
time from my experience.
It seems to be related to the kind of memory data structure used
to store
the document names.
A workaround could be to insert your documents under a common root
xml
element with XQuery Update.
Best,
Fabrice Etanchaud
Questel-Orbit
De : basex-talk-bounces@mailman.uni-konstanz.de [mailto:basex-talk-bounces@mailman.uni-konstanz.de] De la part de
freesoft
Envoyé : lundi 15 avril 2013 10:19 À : basex-talk@mailman.uni-konstanz.de Objet : [basex-talk] Adding millions of XML files
Hi, I'm new to BaseX and to XQuery. I already knew XPath. I'm
evaluating
BaseX to store our XML files and make queries on them. We have to
store
about 1 million of XML files per month. The XML files are little
(~1 KB to 5
KB). So our case is: High number of files, little size.
I've read that BaseX is scalable and has high performance, so it
is probably
a good tool for us. But, in the tests I'm doing, I'm getting an
"Out of Main
Memory" error when loading high number of XML files.
For exaple, if I create a new database ("testdb"), and add 3 XML
files, no
problem occurs. Files are stored correctly, and I can make queries
on them.
Then, if I try to add 18000 XML files to the same database
("testdb") (by
using GUI > Database > Properties > Add Resources), then I see how
the
coloured memory bar grows and grows... until an error appears:
Out of Main Memory. You can try to: - increase Java's heap size with the flag -Xmx<size> - deactivate the text and attribute indexes.
The text and attribute indexes are disabled, so it is not the
cause. And I
increased the Java size with the flag -Xmx<size> (by editing the basexgui.bat script), and same error happens.
Probaly BaseX loads all files to main memory first, and then dumps
them to
the database files. That shouldn't be done in that way. For each
XML file,
it should be loaded into main memory, then procesed and then
dumped to the
db files. For each file, independently from the rest.
So I have two questions:
- Do I have to use an special way to add high number of XML
files?
- Is BaseX sufficiently stable to store and manage our data
(about 1
million of files will be added per month)?
Thank you for our help and for your great software, and excuse me
if I am
asking for recurrent questions.
BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk
basex-talk@mailman.uni-konstanz.de