Hi Basex,
I have a requirement of querying on large number of xml files some where around 10,000 xml files , I have written the query and while executing the query it is taking huge amount of memory and time some where around 700mb memory and time around 4 -5 minutes .Is there a way to execute the query with less memory and with in short time.
Thanks & Regards
Sateesh.A
Hi Sateesh,
I have a requirement of querying on large number of xml files some where around 10,000 xml files , I have written the query and while executing the query it is taking huge amount of memory and time some where around 700mb memory and time around 4 -5 minutes .Is there a way to execute the query with less memory and with in short time.
Probably yes, but this depends on your query. Could you provide some example Code and maybe one of you 10k XML files? In case you do not want to send them to the list, use support@basex.org for the attachments.
Kind regards Michael
Hi Sateesh,
thanks for the data you sent us.
TL;DR:===========================================================================
you are querying 10000 files ad-hoc (i.e. open, parse and query each file in memory). -> solution: create a collection (that contains the files pre-parsed) and query that database instance.
===========================================================================TL;DR:
1) General remarks: You are comparing node names like so:
let $cn := $R/*[xs:string(node-name(.)) = $nn]
where node-name(.) constructs a QName, which will then be cast to a xs:string( ) and compared, this can be achieved more easily by using just name() which returns a string.
let $cn := $R/*[name(.) = $nn]
You have a lot of data($f) calls when you actually only want $f/text() or for attributes $f/string() [0]
2) And probably the best solution for better performance: You are creating in memory document instances on the fly: for each file you are opening by iterating through $fpnode//filepaths/file you: .1 parse it .2 represent it as an in memory tree .3 query it.
It would be much more efficient if you create a collection [1] (BaseX will add all XML files from your data directory to a collection once) and query the files located inside the collection.
I made a small example with 100 copies of your file the query takes 4seconds when each XML document is parsed and queried ad hoc. When I create a collection with 100 copies of your file and run the query it takes only ~500milliseconds.
When you created a collection change the line that opens the documents to:
let $x := doc("collection-sateesh/" || tokenize($f,"/")[last()] )
which does the following: The
tokenize($f,"/")[last()]
takes your path attributes like "c:/data/abc.xml" and returns the filename (the part after the last() slash). the `||` operator concatenates it, so we open each document of your collection that is referenced in the filenames and run your remaining query unchanged.
I'll send the updated XQuery file privately so you can have a look.
Kind regards Michael
[0] https://gist.github.com/faecd677274ac6ac7770 [1] http://docs.basex.org/wiki/Databases Am 10.08.2012 um 09:24 schrieb Michael Seiferle ms@basex.org:
Hi Sateesh,
I have a requirement of querying on large number of xml files some where around 10,000 xml files , I have written the query and while executing the query it is taking huge amount of memory and time some where around 700mb memory and time around 4 -5 minutes .Is there a way to execute the query with less memory and with in short time.
Probably yes, but this depends on your query. Could you provide some example Code and maybe one of you 10k XML files? In case you do not want to send them to the list, use support@basex.org for the attachments.
Kind regards Michael
Hi Michael,
Thanks for the quick reply,really amazed to get the response in such a short time.
Will get back to you post making the suggested changes.
Thanks & Regards Sateesh.A
-----Original Message----- From: Michael Seiferle [mailto:ms@basex.org] Sent: Friday, August 10, 2012 2:26 PM To: sateesh Cc: basex-talk@mailman.uni-konstanz.de Subject: Re: [basex-talk] large number of xml files
Hi Sateesh,
thanks for the data you sent us.
TL;DR:====================================================================== =====
you are querying 10000 files ad-hoc (i.e. open, parse and query each file in memory). -> solution: create a collection (that contains the files pre-parsed) and query that database instance.
===========================================================================T L;DR:
1) General remarks: You are comparing node names like so:
let $cn := $R/*[xs:string(node-name(.)) = $nn]
where node-name(.) constructs a QName, which will then be cast to a xs:string( ) and compared, this can be achieved more easily by using just name() which returns a string.
let $cn := $R/*[name(.) = $nn]
You have a lot of data($f) calls when you actually only want $f/text() or for attributes $f/string() [0]
2) And probably the best solution for better performance: You are creating in memory document instances on the fly: for each file you are opening by iterating through $fpnode//filepaths/file you: .1 parse it .2 represent it as an in memory tree .3 query it.
It would be much more efficient if you create a collection [1] (BaseX will add all XML files from your data directory to a collection once) and query the files located inside the collection.
I made a small example with 100 copies of your file the query takes 4seconds when each XML document is parsed and queried ad hoc. When I create a collection with 100 copies of your file and run the query it takes only ~500milliseconds.
When you created a collection change the line that opens the documents to:
let $x := doc("collection-sateesh/" || tokenize($f,"/")[last()] )
which does the following: The
tokenize($f,"/")[last()]
takes your path attributes like "c:/data/abc.xml" and returns the filename (the part after the last() slash). the `||` operator concatenates it, so we open each document of your collection that is referenced in the filenames and run your remaining query unchanged.
I'll send the updated XQuery file privately so you can have a look.
Kind regards Michael
[0] https://gist.github.com/faecd677274ac6ac7770 [1] http://docs.basex.org/wiki/Databases Am 10.08.2012 um 09:24 schrieb Michael Seiferle ms@basex.org:
Hi Sateesh,
I have a requirement of querying on large number of xml files some where
around 10,000 xml files , I have written the query and while executing the query it is taking huge amount of memory and time some where around 700mb memory and time around 4 -5 minutes .Is there a way to execute the query with less memory and with in short time.
Probably yes, but this depends on your query. Could you provide some example Code and maybe one of you 10k XML files? In
case you do not want to send them to the list, use support@basex.org for the attachments.
Kind regards Michael
Hi Micheal,
I have tried to implemet your suggested changes , but I got struck as the 10k xml's which I have to query on comes from different folders,and also one more question is how do I create collections using the program before running the query.
Thanks & Regards Sateesh.A
-----Original Message----- From: sateesh [mailto:sateesh@intense.in] Sent: Friday, August 10, 2012 4:14 PM To: 'Michael Seiferle' Cc: 'basex-talk@mailman.uni-konstanz.de' Subject: RE: [basex-talk] large number of xml files
Hi Michael,
Thanks for the quick reply,really amazed to get the response in such a short time.
Will get back to you post making the suggested changes.
Thanks & Regards Sateesh.A
-----Original Message----- From: Michael Seiferle [mailto:ms@basex.org] Sent: Friday, August 10, 2012 2:26 PM To: sateesh Cc: basex-talk@mailman.uni-konstanz.de Subject: Re: [basex-talk] large number of xml files
Hi Sateesh,
thanks for the data you sent us.
TL;DR:====================================================================== =====
you are querying 10000 files ad-hoc (i.e. open, parse and query each file in memory). -> solution: create a collection (that contains the files pre-parsed) and query that database instance.
===========================================================================T L;DR:
1) General remarks: You are comparing node names like so:
let $cn := $R/*[xs:string(node-name(.)) = $nn]
where node-name(.) constructs a QName, which will then be cast to a xs:string( ) and compared, this can be achieved more easily by using just name() which returns a string.
let $cn := $R/*[name(.) = $nn]
You have a lot of data($f) calls when you actually only want $f/text() or for attributes $f/string() [0]
2) And probably the best solution for better performance: You are creating in memory document instances on the fly: for each file you are opening by iterating through $fpnode//filepaths/file you: .1 parse it .2 represent it as an in memory tree .3 query it.
It would be much more efficient if you create a collection [1] (BaseX will add all XML files from your data directory to a collection once) and query the files located inside the collection.
I made a small example with 100 copies of your file the query takes 4seconds when each XML document is parsed and queried ad hoc. When I create a collection with 100 copies of your file and run the query it takes only ~500milliseconds.
When you created a collection change the line that opens the documents to:
let $x := doc("collection-sateesh/" || tokenize($f,"/")[last()] )
which does the following: The
tokenize($f,"/")[last()]
takes your path attributes like "c:/data/abc.xml" and returns the filename (the part after the last() slash). the `||` operator concatenates it, so we open each document of your collection that is referenced in the filenames and run your remaining query unchanged.
I'll send the updated XQuery file privately so you can have a look.
Kind regards Michael
[0] https://gist.github.com/faecd677274ac6ac7770 [1] http://docs.basex.org/wiki/Databases Am 10.08.2012 um 09:24 schrieb Michael Seiferle ms@basex.org:
Hi Sateesh,
I have a requirement of querying on large number of xml files some where
around 10,000 xml files , I have written the query and while executing the query it is taking huge amount of memory and time some where around 700mb memory and time around 4 -5 minutes .Is there a way to execute the query with less memory and with in short time.
Probably yes, but this depends on your query. Could you provide some example Code and maybe one of you 10k XML files? In
case you do not want to send them to the list, use support@basex.org for the attachments.
Kind regards Michael
Hi Michael,
Waiting for your response.
Thanks & Regards Sateesh.A
-----Original Message----- From: sateesh [mailto:sateesh@intense.in] Sent: Thursday, August 16, 2012 7:37 PM To: 'Michael Seiferle' Cc: 'basex-talk@mailman.uni-konstanz.de' Subject: RE: [basex-talk] large number of xml files
Hi Micheal,
I have tried to implemet your suggested changes , but I got struck as the 10k xml's which I have to query on comes from different folders,and also one more question is how do I create collections using the program before running the query.
Thanks & Regards Sateesh.A
-----Original Message----- From: sateesh [mailto:sateesh@intense.in] Sent: Friday, August 10, 2012 4:14 PM To: 'Michael Seiferle' Cc: 'basex-talk@mailman.uni-konstanz.de' Subject: RE: [basex-talk] large number of xml files
Hi Michael,
Thanks for the quick reply,really amazed to get the response in such a short time.
Will get back to you post making the suggested changes.
Thanks & Regards Sateesh.A
-----Original Message----- From: Michael Seiferle [mailto:ms@basex.org] Sent: Friday, August 10, 2012 2:26 PM To: sateesh Cc: basex-talk@mailman.uni-konstanz.de Subject: Re: [basex-talk] large number of xml files
Hi Sateesh,
thanks for the data you sent us.
TL;DR:====================================================================== =====
you are querying 10000 files ad-hoc (i.e. open, parse and query each file in memory). -> solution: create a collection (that contains the files pre-parsed) and query that database instance.
===========================================================================T L;DR:
1) General remarks: You are comparing node names like so:
let $cn := $R/*[xs:string(node-name(.)) = $nn]
where node-name(.) constructs a QName, which will then be cast to a xs:string( ) and compared, this can be achieved more easily by using just name() which returns a string.
let $cn := $R/*[name(.) = $nn]
You have a lot of data($f) calls when you actually only want $f/text() or for attributes $f/string() [0]
2) And probably the best solution for better performance: You are creating in memory document instances on the fly: for each file you are opening by iterating through $fpnode//filepaths/file you: .1 parse it .2 represent it as an in memory tree .3 query it.
It would be much more efficient if you create a collection [1] (BaseX will add all XML files from your data directory to a collection once) and query the files located inside the collection.
I made a small example with 100 copies of your file the query takes 4seconds when each XML document is parsed and queried ad hoc. When I create a collection with 100 copies of your file and run the query it takes only ~500milliseconds.
When you created a collection change the line that opens the documents to:
let $x := doc("collection-sateesh/" || tokenize($f,"/")[last()] )
which does the following: The
tokenize($f,"/")[last()]
takes your path attributes like "c:/data/abc.xml" and returns the filename (the part after the last() slash). the `||` operator concatenates it, so we open each document of your collection that is referenced in the filenames and run your remaining query unchanged.
I'll send the updated XQuery file privately so you can have a look.
Kind regards Michael
[0] https://gist.github.com/faecd677274ac6ac7770 [1] http://docs.basex.org/wiki/Databases Am 10.08.2012 um 09:24 schrieb Michael Seiferle ms@basex.org:
Hi Sateesh,
I have a requirement of querying on large number of xml files some where
around 10,000 xml files , I have written the query and while executing the query it is taking huge amount of memory and time some where around 700mb memory and time around 4 -5 minutes .Is there a way to execute the query with less memory and with in short time.
Probably yes, but this depends on your query. Could you provide some example Code and maybe one of you 10k XML files? In
case you do not want to send them to the list, use support@basex.org for the attachments.
Kind regards Michael
Sateesh,
sorry I totally overlooked your last email. I'll reply inline: Am 18.08.2012 um 08:58 schrieb "sateesh" sateesh@intense.in:
Hi Micheal,
I have tried to implemet your suggested changes , but I got struck as the 10k xml's which I have to query on comes from different folders,and also one more question is how do I create collections using the program before running the query.
XQuery at the moment has no possibility to create a collection on the fly, as such you would have to use our Java API [1] or Commandline API [2].
For creating a collection from different folders you would do as follows: create db myDB "path/to/files";
• creates the database coll with all documents found in the input directory.
ADD TO target/ xmldir
• adds all files from the xmldir directory to the database in the target path.
I hope this helps :-)
Kind Regards Michael
Thanks & Regards Sateesh.A
[1] https://github.com/BaseXdb/basex-examples/blob/master/src/main/java/org/base... [2] http://docs.basex.org/wiki/Commands
Hi Michael,
I created the collection of 2k xml's as per your previous mail and tried executing the query,even though after creating the collection also the memory consumption is high(700MB of heap memory) and also it is taking 3 mins of time for processing.
Thanks & Regards Sateesh.A
-----Original Message----- From: Michael Seiferle [mailto:ms@basex.org] Sent: Monday, August 20, 2012 2:33 PM To: sateesh Cc: basex-talk@mailman.uni-konstanz.de Subject: Re: [basex-talk] large number of xml files
Sateesh,
sorry I totally overlooked your last email. I'll reply inline: Am 18.08.2012 um 08:58 schrieb "sateesh" sateesh@intense.in:
Hi Micheal,
I have tried to implemet your suggested changes , but I got struck as the 10k xml's which I have to query on comes from different folders,and also
one
more question is how do I create collections using the program before running the query.
XQuery at the moment has no possibility to create a collection on the fly, as such you would have to use our Java API [1] or Commandline API [2].
For creating a collection from different folders you would do as follows: create db myDB "path/to/files";
. creates the database coll with all documents found in the input
directory.
ADD TO target/ xmldir
. adds all files from the xmldir directory to the database in the
target path.
I hope this helps :-)
Kind Regards Michael
Thanks & Regards Sateesh.A
[1] https://github.com/BaseXdb/basex-examples/blob/master/src/main/java/org/base x/examples/query/CreateCollection.java [2] http://docs.basex.org/wiki/Commands
Hi Sateesh,
is saw that you sent dirk an XQuery file, is it the same that takes that much memory? In case yes we will see if we can help with that :) Kind Regards Michael Am 20.08.2012 um 14:34 schrieb sateesh sateesh@intense.in:
Hi Michael,
I created the collection of 2k xml's as per your previous mail and tried executing the query,even though after creating the collection also the memory consumption is high(700MB of heap memory) and also it is taking 3 mins of time for processing.
Thanks & Regards Sateesh.A
-----Original Message----- From: Michael Seiferle [mailto:ms@basex.org] Sent: Monday, August 20, 2012 2:33 PM To: sateesh Cc: basex-talk@mailman.uni-konstanz.de Subject: Re: [basex-talk] large number of xml files
Sateesh,
sorry I totally overlooked your last email. I'll reply inline: Am 18.08.2012 um 08:58 schrieb "sateesh" sateesh@intense.in:
Hi Micheal,
I have tried to implemet your suggested changes , but I got struck as the 10k xml's which I have to query on comes from different folders,and also
one
more question is how do I create collections using the program before running the query.
XQuery at the moment has no possibility to create a collection on the fly, as such you would have to use our Java API [1] or Commandline API [2].
For creating a collection from different folders you would do as follows: create db myDB "path/to/files";
. creates the database coll with all documents found in the input
directory.
ADD TO target/ xmldir
. adds all files from the xmldir directory to the database in the
target path.
I hope this helps :-)
Kind Regards Michael
Thanks & Regards Sateesh.A
[1] https://github.com/BaseXdb/basex-examples/blob/master/src/main/java/org/base x/examples/query/CreateCollection.java [2] http://docs.basex.org/wiki/Commands
HI Michael,
For dirk that was a separate issue(grouping of records) in that also iam facing the memory issue,and in our case of querying on 10k xml's after creating collections also is taking huge memory as mentioned in my previous mail.
Waiting for your suggestions ,It would really help me in closing the issue as I am at a crucial stage of the project.
Thanks & Regards Sateesh.A
-----Original Message----- From: Michael Seiferle [mailto:ms@basex.org] Sent: Tuesday, August 21, 2012 1:52 PM To: sateesh Cc: basex-talk@mailman.uni-konstanz.de Subject: Re: [basex-talk] large number of xml files
Hi Sateesh,
is saw that you sent dirk an XQuery file, is it the same that takes that much memory? In case yes we will see if we can help with that :) Kind Regards Michael Am 20.08.2012 um 14:34 schrieb sateesh sateesh@intense.in:
Hi Michael,
I created the collection of 2k xml's as per your previous mail and tried executing the query,even though after creating the collection also the memory consumption is high(700MB of heap memory) and also it is taking 3 mins of time for processing.
Thanks & Regards Sateesh.A
-----Original Message----- From: Michael Seiferle [mailto:ms@basex.org] Sent: Monday, August 20, 2012 2:33 PM To: sateesh Cc: basex-talk@mailman.uni-konstanz.de Subject: Re: [basex-talk] large number of xml files
Sateesh,
sorry I totally overlooked your last email. I'll reply inline: Am 18.08.2012 um 08:58 schrieb "sateesh" sateesh@intense.in:
Hi Micheal,
I have tried to implemet your suggested changes , but I got struck as the 10k xml's which I have to query on comes from different folders,and also
one
more question is how do I create collections using the program before running the query.
XQuery at the moment has no possibility to create a collection on the fly, as such you would have to use our Java API [1] or Commandline API [2].
For creating a collection from different folders you would do as follows: create db myDB "path/to/files";
. creates the database coll with all documents found in the input
directory.
ADD TO target/ xmldir
. adds all files from the xmldir directory to the database in the
target path.
I hope this helps :-)
Kind Regards Michael
Thanks & Regards Sateesh.A
[1]
https://github.com/BaseXdb/basex-examples/blob/master/src/main/java/org/base
x/examples/query/CreateCollection.java [2] http://docs.basex.org/wiki/Commands
basex-talk@mailman.uni-konstanz.de