Greetings,
It's a good idea... perhaps expand the concept to recommendations on how to deal with large dataset generally.
I just had to write this fun script to unzip sets of xml files and split them into directories that were set of 100 files ( could have been more) because the insertion would otherwise run of memory on my cheap 4GB digital ocean server using -Xmx3512m. By splitting up the data I could run with -Xmx1512m. Seems quite fast split up as well.
Could include tips like this...
echo -e "\n\n 1. Insert 13f /mnt/DATASTORE/SEC/13F/ZIPXML/XMLMETA" cd /mnt/DATASTORE/SEC/13F/ZIPXML/ for j in *.zip; do echo -e "\n\n Process ${j}" rm -r /mnt/DATASTORE/SEC/13F/ZIPXML/XMLMETA/* unzip -q ${j} -d /mnt/DATASTORE/SEC/13F/ZIPXML/ cd /mnt/DATASTORE/SEC/13F/ZIPXML/XMLMETA/ ls | parallel -n100 mkdir {#};mv {} {#} cd /mnt/appuappu-mexxon/COMPANY_XML_DB/13F/SCRIPTS/ for D in `find /mnt/DATASTORE/SEC/13F/ZIPXML/XMLMETA/*/ -type d` do echo -e "\n\n Process ${D}" java -Xmx1512m -Xss8096k -cp ../../../SEC_SERVER/LIB/saxon9ee.jar:../../../SEC_SERVER/LIB/BaseX85.jar org.basex.BaseX -bpath=$D ../XQUERY/add13FDirectory.xq done cd /mnt/DATASTORE/SEC/13F/ZIPXML/ done ./optimizedb.sh
hmm I suppose I shouldn't be using ../../../SEC_SERVER/LIB/saxon9ee.jar in this case. Just copied that from my other scripts..
Regards Alex tech.jahtoe.com bafila.jahtoe.com
On Tue, Jul 12, 2016 at 2:42 PM, Dirk Kirsten dk@basex.org wrote:
Hi Max,
I totally agree. By chance I also yesterday run into some issues with Indexes and found the current documentation especially about index configuration not very exhaustive. Especially from the point of view of inexperienced BaseX users I find it rather inconvenient trying to figure out how to properly create and maintain indexes.
Best regards from the other side of the lake, Dirk
On 07/12/2016 04:33 PM, Maximilian Gärber wrote:
Hi,
I will try to add the infos I found most helpful but I am sure it will not be exhaustive...
Regards,
Max
2016-07-12 9:36 GMT+02:00 Christian Grün christian.gruen@gmail.com:
Hi Max,
Good Idea. I think it would fit into the "Advanced User's Guide". Personally, I would keep XQuery Update in the XQuery section, but "Indexes" (c|sh)ould surely be moved. And "Index Configuration" still needs to be written. Does it mean that you’d be interested in writing such an article? :)
Cheers Christian
I was just thinking there could be other users that stumble over the details of index updates and optimization.
Maybe this deserves a top-level page in the wiki? Then we would have:
- Indexes: Explains about what is there
- XQuery Update: How to use
- Index Configuration: How to configure and good practice
Regards,
Max
-- Dirk Kirsten, BaseX GmbH, http://basexgmbh.de |-- Firmensitz: Blarerstrasse 56, 78462 Konstanz |-- Registergericht Freiburg, HRB: 708285, Geschäftsführer: | Dr. Christian Grün, Dr. Alexander Holupirek, Michael Seiferle `-- Phone: 0049 7531 91 68 276, Fax: 0049 7531 20 05 22