Re: [basex-talk] large number of xml files

10 Aug 2012

      Hi Sateesh,
thanks for the data you sent us.
TL;DR:===========================================================================
you are querying 10000 files ad-hoc (i.e. open, parse and query each file in memory).
-> solution: create a collection (that contains the files pre-parsed) and query that database instance.
===========================================================================TL;DR:
1) General remarks:
You are comparing node names like so:
...
let $cn := $R/*[xs:string(node-name(.)) = $nn]
where node-name(.) constructs a QName, which will then be cast to a xs:string( ) and compared, 
this can be achieved more easily by using just name() which returns a string.
...
let $cn := $R/*[name(.) = $nn]
You have a lot of data($f) calls when you actually only want $f/text() or for attributes $f/string() [0]
2) And probably the best solution for better performance:
You are creating in memory document instances on the fly:
 for each file you are opening by iterating through  $fpnode//filepaths/file you:
    .1 parse it
    .2 represent it as an in memory tree
    .3 query it.
It would be much more efficient if you create a collection [1] (BaseX will add all XML files from your data directory to a collection once) and query the files located inside the collection.
I made a small example with 100 copies of your file the query takes 4seconds when each XML document is parsed and queried ad hoc.
When I create a collection with 100 copies of your file and run the query it takes only ~500milliseconds.
When you created a collection change the line that opens the documents to:
...
let $x := doc("collection-sateesh/" || tokenize($f,"/")[last()] )

which does the following:
The
...
...
tokenize($f,"/")[last()]
takes your path attributes like "c:/data/abc.xml" and returns the filename (the part after the last() slash).
the `||` operator concatenates it, so we open each document of your collection that is referenced in the filenames and run your remaining query unchanged.
I'll send the updated XQuery file privately so you can have a look.
Kind regards
Michael
[0] https://gist.github.com/faecd677274ac6ac7770
[1] http://docs.basex.org/wiki/Databases
Am 10.08.2012 um 09:24 schrieb Michael Seiferle ms@basex.org:
...
Hi Sateesh,
...
I have a requirement of querying on large number of xml files some where around 10,000 xml files , I have written the query and while executing the query it is taking huge amount of memory and time some where around 700mb memory and time around 4 -5 minutes .Is there a way to execute the query with less memory and with in short time.
Probably yes, but this depends on your query.
Could you provide some example Code and maybe one of you 10k XML files? In case you do not want to send them to the list, use support@basex.org for the attachments.
Kind regards
Michael

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

Re: [basex-talk] large number of xml files