Responding to the last question about querying over collections: I
had the same issue and Lukas Kircher provided the answer -
http://www.mail-archive.com/basex-talk@mailman.uni-konstanz.de/msg02100.html
>
>
>
>---- Original Message ----
>From:
kqfhjjgrn@yahoo.es>To:
christian.gruen@gmail.com,
fetanchaud@questel.com>Subject: Re: [basex-talk] Adding millions of XML files
>Date: Mon, 15 Apr 2013 13:12:55 +0100
(BST)
>
>>Worked! :-)
>>
>>I have uninstalled 7.6 and installed 7.7 beta. Then, created the
>empty db, added the 3 files, run the "set addcache true" command,
>added the 17828 files... and no "out of memory" error, just the
>processing info:
>>
>> Path "everything" added in 462943.7 ms.
>>
>>that is ~8 minutes (in my development machine, not in our server).
>>
>>Now I'm going to do some more tests (both for adding and for
>quering), and I'm going to try the "basex" command, in order to add
>XML files automatically to the db.
>>
>>Anyway I would ask some more questions:
>>
>>1. Is 7.7 beta sufficiently stable to be used in our production
>server? Shoud I wait for the final 7.7 release?
>>
>>2. Is the "addcache" property value permanently saved to the
db?
>Should I run the "set addcache true" command everytime I add files?
>>
>>3. Should I keep disabled the Text & Attribute indexes? Is the
>"addcache=on" option sufficient to allow the adition of XML files, so
>I can enable those indexes? Will my queries be slow with those
>indexes disabled?
>>
>>4. Should I run Optimize after every massive insertion (even with
>"addcache=on")?
>>
>>Thank you for the information on limits, it is very useful. In
>particular, the following limits:
>>
>>FileSize: 512 GiB
>>#Files: 536,870,912
>>
>>mean a medium value of exactly 1 KB/file. Since my files are bigger
>than 1 KB (in medium), then the size limit will be reached first (512
>GiB). So my Perl scripts will have to detect the size of the db, and
>if it is bigger than ~500 GB, then they
will create a new db and add
>new XML files to it.
>>
>>Please show me an easy example of how to use several databases in
>the same query. Perhaps something like:
>>
>> for $doc in (collection("db1"), collection("db2"))
>> for $node in $doc/$a_node_path
>> etc...
>>
>>Well, thank you very much for your help. And excuse me for the huge
>amount of questions from a newbie like me :-)
>>
>>
>>freesoft
>>
>>
>>
>>________________________________
>> De: Christian Grün <
christian.gruen@gmail.com>
>>Para: Fabrice Etanchaud <
fetanchaud@questel.com>
>>CC: freesoft <
kqfhjjgrn@yahoo.es>;
>"
basex-talk@mailman.uni-konstanz.de"
><
basex-talk@mailman.uni-konstanz.de>
>>Enviado: Lunes 15 de abril de 2013 12:12
>>Asunto: Re: [basex-talk] Adding millions of XML files
>>
>>
>>Hi kgfhjjgrn,
>>
>>I believe that Fabrice already mentioned all details that should
>help
>>you to build larger databases. The ADDCACHE option [1] (included in
>>the latest stable snapshot [2]) may already be sufficient to add
>your
>>documents via the GUI: simply run the "set addcache true"
command
>via
>>the input bar of the main window before opening the Properties
>dialog.
>>
>>Note that you can access multiple databases with a single XQuery
>call,
>>so if you know that you’ll exceed the limits of a single database at
>>some time (see [3]), simply create new databases in certain
>intervals.
>>
>>Hope this helps,
>>Christian
>>
>>[1]
http://docs.basex.org/wiki/Options#ADDCACHE>>[2]
http://files.basex.org/releases/latest/>>[3]
http://docs.basex.org/wiki/Statistics>>_________________________________________
>>
>>> The size of your test should not cause any problem to
BaseX (18
>000 files
>>> from 1 up to 5 KB)
>>>
>>>
>>>
>>> 1. Did you try to set the ADDCACHE option ?
>>>
>>> 2. You should OPTIMIZE your collection after each batch of
>ADD
>>> commands, even if no index is set.
>>>
>>> 3. Did you try to unset the AUTOFLUSH option, and explicitly
>FLUSH the
>>> updates at batch’s end ?
>>>
>>> 4. The GUI may not be the best place to run updates, did you
>try the
>>> basex command line tools ?
>>>
>>>
>>>
>>> Opening a collection containing a huge number of documents may
>take a long
>>> time from my experience.
>>>
>>> It seems to be
related to the kind of memory data structure used
>to store
>>> the document names.
>>>
>>> A workaround could be to insert your documents under a common root
>xml
>>> element with XQuery Update.
>>>
>>>
>>>
>>>
>>>
>>>
>>> Best,
>>>
>>> Fabrice Etanchaud
>>>
>>> Questel-Orbit
>>>
>>>
>>>
>>>
>>>
>>> De :
basex-talk-bounces@mailman.uni-konstanz.de>>> [mailto:
basex-talk-bounces@mailman.uni-konstanz.de] De la part
de
>freesoft
>>> Envoyé : lundi 15 avril 2013 10:19
>>> À :
basex-talk@mailman.uni-konstanz.de>>> Objet : [basex-talk] Adding millions of XML files
>>>
>>>
>>>
>>> Hi, I'm new to BaseX and to XQuery. I already knew XPath. I'm
>evaluating
>>> BaseX to store our XML files and make queries on them. We have to
>store
>>> about 1 million of XML files per month. The XML files are little
>(~1 KB to 5
>>> KB). So our case is: High number of files, little size.
>>>
>>> I've read that BaseX is scalable and has high performance, so it
>is probably
>>> a good tool for us. But, in the tests I'm doing, I'm getting an
>"Out of Main
>>> Memory" error when
loading high number of XML files.
>>>
>>> For exaple, if I create a new database ("testdb"), and add 3 XML
>files, no
>>> problem occurs. Files are stored correctly, and I can make queries
>on them.
>>> Then, if I try to add 18000 XML files to the same database
>("testdb") (by
>>> using GUI > Database > Properties > Add Resources), then I see how
>the
>>> coloured memory bar grows and grows... until an error appears:
>>>
>>> Out of Main Memory.
>>> You can try to:
>>> - increase Java's heap size with the flag -Xmx<size>
>>> - deactivate the text and attribute indexes.
>>>
>>> The text and attribute indexes are disabled, so it is not the
>cause. And I
>>> increased the
Java size with the flag -Xmx<size> (by editing the
>>> basexgui.bat script), and same error happens.
>>>
>>> Probaly BaseX loads all files to main memory first, and then dumps
>them to
>>> the database files. That shouldn't be done in that way. For each
>XML file,
>>> it should be loaded into main memory, then procesed and then
>dumped to the
>>> db files. For each file, independently from the rest.
>>>
>>> So I have two questions:
>>> 1. Do I have to use an special way to add high number of XML
>files?
>>> 2. Is BaseX sufficiently stable to store and manage our data
>(about 1
>>> million of files will be added per month)?
>>>
>>> Thank you for our help and for your great software, and excuse me
>if I am
>>> asking for recurrent
questions.
>>>
>>>
>>> _______________________________________________
>>> BaseX-Talk mailing list
>>>
BaseX-Talk@mailman.uni-konstanz.de>>>
https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk>>>