Re: [basex-talk] Adding millions of XML files

15 Apr 2013

      Hi Freesoft,
...
I have uninstalled 7.6 and installed 7.7 beta. Then, created the empty db,
added the 3 files, run the "set addcache true" command, added the 17828
files... and no "out of memory" error, just the processing info:
Good to hear. Please note that it’s always faster to specify initial
documents along with the CREATE DB command instead of adding them in a
second step (but I’m aware that you’re mainly interested in the time
required to incrementally add new documents).
...

Is 7.7 beta sufficiently stable to be used in our production server?

Shoud I wait for the final 7.7 release?
The current snapshot should be a safe bet, as there will be no
critical updates until the official release.
...

Is the "addcache" property value permanently saved to the db? Should I

run the "set addcache true" command everytime I add files?
The value of ADDCACHE is bound to the current BaseX instance and won't
be stored in the database. This means that you’ll have to set it to
true whenever you run a new BaseX instance.
But.. As you stumbled upon an issue that has also been discussed
before, I had yet another look at the ADD command, and I added some
heuristics for directory inputs. If the documents to be added are
expected to blow up main memory, they will be cached even if ADDCACHE
is set to false. You are invited to check out the latest version [1]
and give us some more feedback.
...

Should I keep disabled the Text & Attribute indexes? Is the "addcache=on"

option sufficient to allow the adition of XML files, so I can enable those
indexes? Will my queries be slow with those indexes disabled?
If text and attribute indexes are enabled, they will be invalidated
with an update and restored with the next OPTIMIZE call, so it’s a
good choice to keep the defaults. Not all queries will get slower
without indexes. You can have a look at the query info (shown e.g. in
the GUI’s InfoView) to see if the query plans with and without index
structures differ.
...

Should I run Optimize after every massive insertion (even with

"addcache=on")?
It’s generally advisable to run OPTIMIZE whenever you want to perform
queries on your new data.
...
mean a medium value of exactly 1 KB/file. Since my files are bigger than 1
KB (in medium), then the size limit will be reached first (512 GiB).
My assumption is that you will first hit the node id limit (#Nodes),
but simply try and see what happens.
...
Please show me an easy example of how to use several databases in the same
query. Perhaps something like:
for $doc in (collection("db1"), collection("db2"))
for $node in $doc/$a_node_path

Looks fine. This is one more alternative:
for $i in 1 to 100
   let $db := "db" || $i
   return db:open($db)/your/path
...
Well, thank you very much for your help. And excuse me for the huge amount
of questions from a newbie like me :-)
Your questions are welcome. If you got some free time, you are invited
to read out documentation; many of its contents have been inspiried by
earlier discussions on this list.
Christian
[1] http://files.basex.org/releases/latest/
[2] http://docs.basex.org/wiki/Main_Page

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

Re: [basex-talk] Adding millions of XML files