Hello,
I have a 1GB sized xml file. which contains 94,50,001 records. while I fired query to retrieve all the records it will take approx 30 secs to execute a query.
*I fired below mentioned query using Java API (QueryProcessor). *
"transaction/* except (/transaction/traInfo)"
QueryProcessor proc = new QueryProcessor(query, context);
Result result = proc.execute(); System.out.println(result.size());
Iter itr = proc.iter();
it took *40 Secs* on execution of the above code. is there a scope of improvement without changing the query ?
*My XML file is look like this.*
<transaction>
<traInfo id="ti1"> <date>01-01-2014</date> <source>bank1</source> </traInfo>
<traInfo id="ti2"> <date>01-01-2014</date> <source>bank2</source> </traInfo>
<traInfo id="ti3"> <date>01-01-2014</date> <source>bank3</source> </traInfo>
<income transInfoRef="ti1">1000</income> <assets transInfoRef="ti1">1000</assets> <liablity transInfoRef="ti1">1000</liablity> <grossprofit transInfoRef="ti1">1000</grossprofit>
<income transInfoRef="ti2">1000</income> <assets transInfoRef="ti2">1000</assets> <liablity transInfoRef="ti2">1000</liablity> <grossprofit transInfoRef="ti2">1000</grossprofit>
<income transInfoRef="ti3">1000</income> <assets transInfoRef="ti3">1000</assets> <liablity transInfoRef="ti3">1000</liablity> <grossprofit transInfoRef="ti3">1000</grossprofit>
...
</transaction>
Hi Kunal,
we need more information to help you here. How does your query look like? Does it benefit from index structures (please check out the output in the QueryInfo panel)? Are your index structures up-to-date?
while I fired query to retrieve all the records it will take approx 30 secs to execute a query.
System.out.println(result.size());
Just in case: printing something to stdout takes often much more time than evaluating the query.
Christian
Hi Christian,
how to print QueryInfo through Java API ?
I checked query in BaseX GUI and over there it shows Total time : 21 secs. What else do I need to check in QueryInfo panel ?
apart from this following is my DB info.
*Indexes* Up-to-date: true TEXTINDEX: true ATTRINDEX: true FTINDEX: true LANGUAGE: English STEMMING: false CASESENS: false DIACRITICS: false STOPWORDS: UPDINDEX: false MAXCATS: 100 MAXLEN: 96
and also I observed several timestamp in my code. which are I am sharing with you.
QueryProcessor proc = new QueryProcessor(query, context);
Result result = proc.execute(); // * took time : 21259 mili secs* System.out.println(result.size());
Iter itr = proc.iter();
while((item = itr.next()) != null){ if(count >= start) System.out.println(/*item.serialize()*/); count++; if(count > end) break;
} // * loop took 18 More secs to complete.*
total time taken is 40 secs for following query.
*"transaction/* except (/transaction/traInfo)"*
- Kunal
On Thu, Aug 21, 2014 at 7:29 PM, Christian Grün christian.gruen@gmail.com wrote:
Hi Kunal,
we need more information to help you here. How does your query look like? Does it benefit from index structures (please check out the output in the QueryInfo panel)? Are your index structures up-to-date?
while I fired query to retrieve all the records it will take approx 30
secs
to execute a query.
System.out.println(result.size());
Just in case: printing something to stdout takes often much more time than evaluating the query.
Christian
how to print QueryInfo through Java API ?
The -V flag or the QUERYINFO option are alternatives; please find more information in the documentation.
"transaction/* except (/transaction/traInfo)"
You could try to rewrite this to...
transaction/* except traInfo
...or...
let $traInfo := /transaction/traInfo return transaction/* except $traInfo
...or...
transaction/*[name() ne 'traInfo']
Hi Christian,
Thanks for your suggestion. It gave me considerable improvement.
Now, I am able to get query result in *13 secs* and loop will be completed within half a sec. It's defiantly a remarkable achievement.
Still, is there any way to optimize this time ?
It would be great if you can explain me why the query and loop took so much time earlier and now it gets completed quickly.
Apart from this I have one concern that in my application XQueries will be provided by end users. So, every time I wouldn't be able to change or optimize the query.
does BaseX use any query optimizer or can you suggest me any external tool / lib for the same ?
Regards, Kunal Chauhan
On Thu, Aug 21, 2014 at 9:11 PM, Christian Grün christian.gruen@gmail.com wrote:
transaction/* except traInfo
Sorry, this one was nonsense.
It would be great if you can explain me why the query and loop took so much time earlier and now it gets completed quickly.
In your original query...
transaction/* except (/transaction/traInfo)
...the second path expression was evaluated for each result of the first expression. While it would theoretically be possible for the query processor to cache the results of the second expression, it is difficult in practice to decide when this is reasonable. Beside that, you have a potentially large number of sets that need to be compared every time, resulting in max. n*n comparisons (or O(n²)). The following query (which is probably the one you chose?) will always be linear:
transaction/*[name() ne 'traInfo']
Apart from this I have one concern that in my application XQueries will be provided by end users. So, every time I wouldn't be able to change or optimize the query.
It is hardly possible to limit queries of end users to those that are fast enough to be processed in a given time. As an example, the following query will take hours or even days to compute, even if it looks that simple:
(1 to 10000000000)[. = 0]
However, you can limit evaluation time and memory consumption, as described here:
http://docs.basex.org/wiki/XQuery_Module#xquery:eval
does BaseX use any query optimizer or can you suggest me any external tool / lib for the same ?
BaseX would be pretty much worthless without query optimizer, so I don't quite get what you mean by that?
Christian
Thanks Christian,
I changed the query from *"transaction/* except (/transaction/traInfo)" *to* "**transaction/*[name() ne 'traInfo']**" *as suggested by you. later one takes *13 secs* to complete while former was taking* 21 secs.*
*LOOP * while((item = itr.next()) != null){ if(count >= start) System.out.println(/*item.serialize()*/); count++; if(count > end) break;
}
*Former Query : "transaction/* except (/transaction/traInfo)"*
*Loop (executed with java api) ** takes 18-19 seconds*
*QueryInfo : *
Timing: - Parsing: 1.02 ms - Compiling: 3.53 ms - Evaluating: 21157.98 ms - Printing: 144.41 ms - Total Time: 21306.94 ms
*Later Query (suggested by you) **"**transaction/*[name() ne 'traInfo']* *"*
*Loop (executed with java api) takes 0.5 secs*
*Query Info :*
Timing: - Parsing: 1.0 ms - Compiling: 3.46 ms - Evaluating: 15469.8 ms - Printing: 56.87 ms - Total Time: 15531.14 ms
*Query will return around 94 lacks items.*
So, I was wondering apart from query change, is there any baseX tuning or configuration changes should I do to further improve time from 13 secs.
meanwhile I tried to run the same query on a high end machine ( 64 GB RAM, 8 cores Linux machine). BaseX was started with -Xmx32g. I didn't find any improvement in execution time.
I am also attaching -Xrunhprof:cpu for former query = java.hprof_former_query.txt later query = java.hprof_later_query.txt
Hope above information suffices you. Thanks
On Fri, Aug 22, 2014 at 5:01 PM, Christian Grün christian.gruen@gmail.com wrote:
It would be great if you can explain me why the query and loop took so
much
time earlier and now it gets completed quickly.
In your original query...
transaction/* except (/transaction/traInfo)
...the second path expression was evaluated for each result of the first expression. While it would theoretically be possible for the query processor to cache the results of the second expression, it is difficult in practice to decide when this is reasonable. Beside that, you have a potentially large number of sets that need to be compared every time, resulting in max. n*n comparisons (or O(n²)). The following query (which is probably the one you chose?) will always be linear:
transaction/*[name() ne 'traInfo']
Apart from this I have one concern that in my application XQueries will
be
provided by end users. So, every time I wouldn't be able to change or optimize the query.
It is hardly possible to limit queries of end users to those that are fast enough to be processed in a given time. As an example, the following query will take hours or even days to compute, even if it looks that simple:
(1 to 10000000000)[. = 0]
However, you can limit evaluation time and memory consumption, as described here:
http://docs.basex.org/wiki/XQuery_Module#xquery:eval
does BaseX use any query optimizer or can you suggest me any external
tool
/ lib for the same ?
BaseX would be pretty much worthless without query optimizer, so I don't quite get what you mean by that?
Christian
Hi Kunal,
thanks for giving more details.
later one takes 13 secs to complete while former was taking 21 secs.
Former Query : "transaction/* except (/transaction/traInfo)" Loop (executed with java api) takes 18-19 seconds
Later Query (suggested by you) "transaction/*[name() ne 'traInfo']" Loop (executed with java api) takes 0.5 secs
This is still sth. I don't quite get: does the second query take 0.5 or 13 seconds?
Query will return around 94 lacks items.
So there are 94 child element that are not named "traInfo", but "lacks"? Two more questions.. What's the total number of child nodes (try count(/transaction/*)?
So, I was wondering apart from query change, is there any baseX tuning or configuration changes should I do to further improve time from 13 secs.
The query profiling results suggest that some additional time is spent for checking namespaces. Maybe the following query is slightly faster:
transaction/*[local-name() ne 'traInfo']
The remaining information indicates that most time is spent for sequentially parsing the document. If you work with the client/server architecture, you might experience beneficial caching effects.
Hope this helps, Christian
Hi Christian,
This is still sth. I don't quite get: does the second query take 0.5 or 13 seconds?
Actually we are doing two things. 1) We are firing the query through Java API QueryProcessor. 2) we are iterating the results to get serialize items.
while firing the query ("transaction/*[local-name() ne 'traInfo']"). it took 13 seconds while iterating over the resultset it will takes 0.5 secs.
Query will return around 94 lacks items.
So there are 94 child element that are not named "traInfo", but "lacks"? Two more questions.. What's the total number of child nodes
(try count(/transaction/*)?
There are 94 lacks child element that are not names "traInfo" as result
of *"transaction/*[local-name() ne 'traInfo']"* query.
Total child elements are around 1.01 crores. result of
(try count(/transaction/*).
So, I was wondering apart from query change, is there any baseX tuning or configuration changes should I do to further improve time from 13 secs.
The query profiling results suggest that some additional time is spent for checking namespaces. Maybe the following query is slightly faster:
transaction/*[local-name() ne 'traInfo']
I tried the above suggested query, It takes the same time to execute as well as loop iteration.
Regards,
Hi Kunal,
Actually we are doing two things.
- We are firing the query through Java API QueryProcessor.
- we are iterating the results to get serialize items.
Thanks, that was helpful. I had a look at your original Java code again: There is no need to call proc.execute() (which will compute the full result). Instead, it is sufficient to call proc.iter().
Total child elements are around 1.01 crores. result of (try count(/transaction/*).
Does 1.01 mean 1.01 million elements?
Hope this helps, Christian
Hi Christian,
Thanks for your great help !!!
in a previous mail trail you asked, Does 1.01 mean 1.01 million elements?
It means *10.01 million *elements (First child of transaction (root element). "count(/transaction/*)").
Thanks & Regards, Kunal Chauhan
On Mon, Aug 25, 2014 at 2:22 PM, Christian Grün christian.gruen@gmail.com wrote:
Hi Kunal,
Actually we are doing two things.
- We are firing the query through Java API QueryProcessor.
- we are iterating the results to get serialize items.
Thanks, that was helpful. I had a look at your original Java code again: There is no need to call proc.execute() (which will compute the full result). Instead, it is sufficient to call proc.iter().
Total child elements are around 1.01 crores. result of (try count(/transaction/*).
Does 1.01 mean 1.01 million elements?
Hope this helps, Christian
basex-talk@mailman.uni-konstanz.de