optimize BaseX performance

List overview All Threads
Download

newer

older

Apply variable argument list to...

Query Cache?

Kunal Chauhan

21 Aug 2014 21 Aug '14

9:40 a.m.

Hello,

I have a 1GB sized xml file. which contains 94,50,001 records. while I fired query to retrieve all the records it will take approx 30 secs to execute a query.

*I fired below mentioned query using Java API (QueryProcessor). *

"transaction/* except (/transaction/traInfo)"

QueryProcessor proc = new QueryProcessor(query, context);

Result result = proc.execute(); System.out.println(result.size());

Iter itr = proc.iter();

it took *40 Secs* on execution of the above code. is there a scope of improvement without changing the query ?

*My XML file is look like this.*

...

</transaction>

-- *Kunal Chauhan* mail4ck@gmail.com [+918655517141]

Attachments:

attachment.html (text/html — 6.2 KB)

Show replies by date

Christian Grün

21 Aug 21 Aug

9:59 a.m.

Hi Kunal,

we need more information to help you here. How does your query look like? Does it benefit from index structures (please check out the output in the QueryInfo panel)? Are your index structures up-to-date?

...

while I fired query to retrieve all the records it will take approx 30 secs to execute a query.

...

System.out.println(result.size());

Just in case: printing something to stdout takes often much more time than evaluating the query.

Christian

Kunal Chauhan

11:07 a.m.

Hi Christian,

how to print QueryInfo through Java API ?

I checked query in BaseX GUI and over there it shows Total time : 21 secs. What else do I need to check in QueryInfo panel ?

apart from this following is my DB info.

*Indexes* Up-to-date: true TEXTINDEX: true ATTRINDEX: true FTINDEX: true LANGUAGE: English STEMMING: false CASESENS: false DIACRITICS: false STOPWORDS: UPDINDEX: false MAXCATS: 100 MAXLEN: 96

and also I observed several timestamp in my code. which are I am sharing with you.

QueryProcessor proc = new QueryProcessor(query, context);

Result result = proc.execute(); // * took time : 21259 mili secs* System.out.println(result.size());

Iter itr = proc.iter();

while((item = itr.next()) != null){ if(count >= start) System.out.println(/*item.serialize()*/); count++; if(count > end) break;

} // * loop took 18 More secs to complete.*

total time taken is 40 secs for following query.

*"transaction/* except (/transaction/traInfo)"*

- Kunal

On Thu, Aug 21, 2014 at 7:29 PM, Christian Grün christian.gruen@gmail.com wrote:

...

Hi Kunal,

we need more information to help you here. How does your query look like? Does it benefit from index structures (please check out the output in the QueryInfo panel)? Are your index structures up-to-date?

...
while I fired query to retrieve all the records it will take approx 30

secs

...
to execute a query.

...
System.out.println(result.size());

Just in case: printing something to stdout takes often much more time than evaluating the query.

Christian

-- *Kunal Chauhan* mail4ck@gmail.com [+918655517141]

Christian Grün

11:39 a.m.

...

how to print QueryInfo through Java API ?

The -V flag or the QUERYINFO option are alternatives; please find more information in the documentation.

...

"transaction/* except (/transaction/traInfo)"

You could try to rewrite this to...

transaction/* except traInfo

...or...

let $traInfo := /transaction/traInfo return transaction/* except $traInfo

...or...

transaction/*[name() ne 'traInfo']

Christian Grün

11:41 a.m.

...

transaction/* except traInfo

Sorry, this one was nonsense.

Kunal Chauhan

22 Aug 22 Aug

6:06 a.m.

Hi Christian,

Thanks for your suggestion. It gave me considerable improvement.

Now, I am able to get query result in *13 secs* and loop will be completed within half a sec. It's defiantly a remarkable achievement.

Still, is there any way to optimize this time ?

It would be great if you can explain me why the query and loop took so much time earlier and now it gets completed quickly.

Apart from this I have one concern that in my application XQueries will be provided by end users. So, every time I wouldn't be able to change or optimize the query.

does BaseX use any query optimizer or can you suggest me any external tool / lib for the same ?

Regards, Kunal Chauhan

On Thu, Aug 21, 2014 at 9:11 PM, Christian Grün christian.gruen@gmail.com wrote:

...

...
transaction/* except traInfo

Sorry, this one was nonsense.

-- *Kunal Chauhan* mail4ck@gmail.com [+918655517141]

Christian Grün

7:31 a.m.

...

It would be great if you can explain me why the query and loop took so much time earlier and now it gets completed quickly.

In your original query...

transaction/* except (/transaction/traInfo)

...the second path expression was evaluated for each result of the first expression. While it would theoretically be possible for the query processor to cache the results of the second expression, it is difficult in practice to decide when this is reasonable. Beside that, you have a potentially large number of sets that need to be compared every time, resulting in max. n*n comparisons (or O(n²)). The following query (which is probably the one you chose?) will always be linear:

transaction/*[name() ne 'traInfo']

...

Apart from this I have one concern that in my application XQueries will be provided by end users. So, every time I wouldn't be able to change or optimize the query.

It is hardly possible to limit queries of end users to those that are fast enough to be processed in a given time. As an example, the following query will take hours or even days to compute, even if it looks that simple:

(1 to 10000000000)[. = 0]

However, you can limit evaluation time and memory consumption, as described here:

http://docs.basex.org/wiki/XQuery_Module#xquery:eval

...

does BaseX use any query optimizer or can you suggest me any external tool / lib for the same ?

BaseX would be pretty much worthless without query optimizer, so I don't quite get what you mean by that?

Christian

Kunal Chauhan

11:07 a.m.

Thanks Christian,

I changed the query from *"transaction/* except (/transaction/traInfo)" *to* "**transaction/*[name() ne 'traInfo']**" *as suggested by you. later one takes *13 secs* to complete while former was taking* 21 secs.*

*LOOP * while((item = itr.next()) != null){ if(count >= start) System.out.println(/*item.serialize()*/); count++; if(count > end) break;

}

*Former Query : "transaction/* except (/transaction/traInfo)"*

*Loop (executed with java api) ** takes 18-19 seconds*

*QueryInfo : *

Timing: - Parsing: 1.02 ms - Compiling: 3.53 ms - Evaluating: 21157.98 ms - Printing: 144.41 ms - Total Time: 21306.94 ms

*Later Query (suggested by you) **"**transaction/*[name() ne 'traInfo']* *"*

*Loop (executed with java api) takes 0.5 secs*

*Query Info :*

Timing: - Parsing: 1.0 ms - Compiling: 3.46 ms - Evaluating: 15469.8 ms - Printing: 56.87 ms - Total Time: 15531.14 ms

*Query will return around 94 lacks items.*

So, I was wondering apart from query change, is there any baseX tuning or configuration changes should I do to further improve time from 13 secs.

meanwhile I tried to run the same query on a high end machine ( 64 GB RAM, 8 cores Linux machine). BaseX was started with -Xmx32g. I didn't find any improvement in execution time.

I am also attaching -Xrunhprof:cpu for former query = java.hprof_former_query.txt later query = java.hprof_later_query.txt

Hope above information suffices you. Thanks

On Fri, Aug 22, 2014 at 5:01 PM, Christian Grün christian.gruen@gmail.com wrote:

...

...
It would be great if you can explain me why the query and loop took so

much

...
time earlier and now it gets completed quickly.

In your original query...

transaction/* except (/transaction/traInfo)

...the second path expression was evaluated for each result of the first expression. While it would theoretically be possible for the query processor to cache the results of the second expression, it is difficult in practice to decide when this is reasonable. Beside that, you have a potentially large number of sets that need to be compared every time, resulting in max. n*n comparisons (or O(n²)). The following query (which is probably the one you chose?) will always be linear:

transaction/*[name() ne 'traInfo']

...
Apart from this I have one concern that in my application XQueries will

be

...
provided by end users. So, every time I wouldn't be able to change or optimize the query.

It is hardly possible to limit queries of end users to those that are fast enough to be processed in a given time. As an example, the following query will take hours or even days to compute, even if it looks that simple:

(1 to 10000000000)[. = 0]

However, you can limit evaluation time and memory consumption, as described here:

http://docs.basex.org/wiki/XQuery_Module#xquery:eval

...
does BaseX use any query optimizer or can you suggest me any external

tool

...
/ lib for the same ?

BaseX would be pretty much worthless without query optimizer, so I don't quite get what you mean by that?

Christian

-- *Kunal Chauhan* mail4ck@gmail.com [+918655517141]

Christian Grün

24 Aug 24 Aug

4:04 p.m.

Hi Kunal,

thanks for giving more details.

...

later one takes 13 secs to complete while former was taking 21 secs.

...

Former Query : "transaction/* except (/transaction/traInfo)" Loop (executed with java api) takes 18-19 seconds

...

Later Query (suggested by you) "transaction/*[name() ne 'traInfo']" Loop (executed with java api) takes 0.5 secs

This is still sth. I don't quite get: does the second query take 0.5 or 13 seconds?

...

Query will return around 94 lacks items.

So there are 94 child element that are not named "traInfo", but "lacks"? Two more questions.. What's the total number of child nodes (try count(/transaction/*)?

...

So, I was wondering apart from query change, is there any baseX tuning or configuration changes should I do to further improve time from 13 secs.

The query profiling results suggest that some additional time is spent for checking namespaces. Maybe the following query is slightly faster:

transaction/*[local-name() ne 'traInfo']

The remaining information indicates that most time is spent for sequentially parsing the document. If you work with the client/server architecture, you might experience beneficial caching effects.

Hope this helps, Christian

Kunal Chauhan

25 Aug 25 Aug

2:59 a.m.

Hi Christian,

...

This is still sth. I don't quite get: does the second query take 0.5 or 13 seconds?

Actually we are doing two things. 1) We are firing the query through Java API QueryProcessor. 2) we are iterating the results to get serialize items.

while firing the query ("transaction/*[local-name() ne 'traInfo']"). it took 13 seconds while iterating over the resultset it will takes 0.5 secs.

...

...
Query will return around 94 lacks items.

So there are 94 child element that are not named "traInfo", but "lacks"? Two more questions.. What's the total number of child nodes

(try count(/transaction/*)?

There are 94 lacks child element that are not names "traInfo" as result

of *"transaction/*[local-name() ne 'traInfo']"* query.

Total child elements are around 1.01 crores. result of

(try count(/transaction/*).

...

...
So, I was wondering apart from query change, is there any baseX tuning or configuration changes should I do to further improve time from 13 secs.

The query profiling results suggest that some additional time is spent for checking namespaces. Maybe the following query is slightly faster:

transaction/*[local-name() ne 'traInfo']

I tried the above suggested query, It takes the same time to execute as well as loop iteration.

Regards,

-- *Kunal Chauhan* mail4ck@gmail.com [+918655517141]

Christian Grün

4:52 a.m.

Hi Kunal,

...

Actually we are doing two things.

We are firing the query through Java API QueryProcessor.

we are iterating the results to get serialize items.

Thanks, that was helpful. I had a look at your original Java code again: There is no need to call proc.execute() (which will compute the full result). Instead, it is sufficient to call proc.iter().

...

Total child elements are around 1.01 crores. result of (try count(/transaction/*).

Does 1.01 mean 1.01 million elements?

Hope this helps, Christian

Kunal Chauhan

10:06 a.m.

Hi Christian,

Thanks for your great help !!!

in a previous mail trail you asked, Does 1.01 mean 1.01 million elements?

It means *10.01 million *elements (First child of transaction (root element). "count(/transaction/*)").

Thanks & Regards, Kunal Chauhan

On Mon, Aug 25, 2014 at 2:22 PM, Christian Grün christian.gruen@gmail.com wrote:

...

Hi Kunal,

...
Actually we are doing two things.

We are firing the query through Java API QueryProcessor.

we are iterating the results to get serialize items.

Thanks, that was helpful. I had a look at your original Java code again: There is no need to call proc.execute() (which will compute the full result). Instead, it is sufficient to call proc.iter().

...
Total child elements are around 1.01 crores. result of (try count(/transaction/*).

Does 1.01 mean 1.01 million elements?

Hope this helps, Christian

-- *Kunal Chauhan* mail4ck@gmail.com [+918655517141]

3980

Age (days ago)

3984

Last active (days ago)

basex-talk@mailman.uni-konstanz.de

11 comments

2 participants

tags (0)

participants (2)

Christian Grün
Kunal Chauhan