Hi there, I am running relatively simply XQuery over a Basex database read in from about 100 MB of XML.
While performance in general is fantastic, with most queries completing in a few seconds, I am getting worryingly slow performace on certain queries using XPath axes such as preceding-sibling.
For example, the following simple (if a bit silly) query takes over 30 minutes to execute:
/tns:pois/tns:poi[not(preceding-sibling::tns:poi)]
For what it's worth, there are only about 100,000 instances of the tns:poi element.
So the immediate question is whether this performance is to be expected. If not, what are the obvious things to look at in terms of performance tuning?
Thanks in advance, Constantine
Constantine Hondros | Senior Data Engineer | TomTom Places | constantine.hondros@tomtom.commailto:constantine.hondros@tomtom.com | +31 (0) 88 245 6239 office | +31 (0) 6 402 14838 mobile | places.tomtom.comhttp://places.tomtom.com/
Constantine,
thanks for the kudos.
/tns:pois/tns:poi[not(preceding-sibling::tns:poi)]
If I get it right, you're trying to select the first tns:poi/ element in your query. This can also be done via
/tns:pois/tns:poi[1]
After all, I still might have a deeper look at the performance of your query, as 30 minutes would of course be terribly slow.
Hope this helps, feel free to ask for more,
Christian BaseX Team ______________________________
So the immediate question is whether this performance is to be expected. If not, what are the obvious things to look at in terms of performance tuning?
Thanks in advance, Constantine
Constantine Hondros | Senior Data Engineer | TomTom Places | constantine.hondros@tomtom.com | +31 (0) 88 245 6239 office | +31 (0) 6 402 14838 mobile | places.tomtom.com
BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk
Hi Christian, The XPath I showed was just a simple example of using the preceding-sibling axis. Of course, I am really interested in more useful queries such as:
//tns:poi[not(tns:id = preceding-sibling::tns:poi/tns:id)]
I guess what I am asking is ... while most XPath queries I run are quick to resolve, anything using preceding::, preceding-sibling:: or following::, following-sibling:: essentially runs too slowly to be workable - at least, on my sample dataset with about 100K sibling children of the root element.
So is this any sort of known issue? Should I be looking at any sort of configuration options? Is there any way to profile the execution of the Xpath?
Cheers,
C.
-----Original Message----- From: Christian Grün [mailto:christian.gruen@gmail.com] Sent: Wednesday, October 19, 2011 8:56 PM To: Constantine Hondros Cc: basex-talk@mailman.uni-konstanz.de Subject: Re: [basex-talk] Performance of Basex 7.0 XQuery execution
Constantine,
thanks for the kudos.
/tns:pois/tns:poi[not(preceding-sibling::tns:poi)]
If I get it right, you're trying to select the first tns:poi/ element in your query. This can also be done via
/tns:pois/tns:poi[1]
After all, I still might have a deeper look at the performance of your query, as 30 minutes would of course be terribly slow.
Hope this helps, feel free to ask for more,
Christian BaseX Team ______________________________
So the immediate question is whether this performance is to be expected. If not, what are the obvious things to look at in terms of performance tuning?
Thanks in advance, Constantine
Constantine Hondros | Senior Data Engineer | TomTom Places | constantine.hondros@tomtom.com | +31 (0) 88 245 6239 office | +31 (0) 6 402 14838 mobile | places.tomtom.com
BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk
Hi Constantine,
The XPath I showed was just a simple example of using the preceding-sibling axis.
I guessed so already.. Thanks for persisting ;)
I guess what I am asking is ... while most XPath queries I run are quick to resolve, anything using preceding::, preceding-sibling:: or following::, following-sibling:: essentially runs too slowly to be workable - at least, on my sample dataset with about 100K sibling children of the root element.
So is this any sort of known issue? Should I be looking at any sort of configuration options? Is there any way to profile the execution of the Xpath?
It's true that BaseX is mostly optimized for descendant and ancestor operations. In particular, the preceding/preceding-sibling axes will always be slower, as our storage does not contain backward pointers to preceding nodes. I'm pretty sure, though, that there are ways to either optimize relevant queries or the internal optimizer strategies, but we'll probably have to spend some more time on analyzing the relevant bottlenecks..
Feel free to ask for more, Christian
Hi Constantine Just my side note. As far as I remember, I found this sort of performance discussion with most the XQuery processors I have tested (Sedna, Zorba, Saxon, eXist...). Each time related to sibling, preceding etc. and each time explaining, that indices are not containing pointers for tracking this relationship. My feeling is, this could lead to much bigger index file and slower performance when adding documents and indexing. My conlcusion is - optimizing this sort of queries has its cost and mostly it is better (and somehow possible) to avoid these constructions.
With best regards
Jan
*Ing. Jan Vlčinský* CAD programy Slunečnicová 338/3, 734 01 Karviná Ráj, Czech Republic tel: +420-597 602 024; mob: +420-608 979 040 skype: janvlcinsky; GoogleTalk: jan.vlcinsky@gmail.com http://cz.linkedin.com/in/vlcinsky
On 20 October 2011 15:43, Christian Grün christian.gruen@gmail.com wrote:
Hi Constantine,
The XPath I showed was just a simple example of using the
preceding-sibling axis.
I guessed so already.. Thanks for persisting ;)
I guess what I am asking is ... while most XPath queries I run are quick
to resolve, anything using preceding::, preceding-sibling:: or following::, following-sibling:: essentially runs too slowly to be workable - at least, on my sample dataset with about 100K sibling children of the root element.
So is this any sort of known issue? Should I be looking at any sort of
configuration options? Is there any way to profile the execution of the Xpath?
It's true that BaseX is mostly optimized for descendant and ancestor operations. In particular, the preceding/preceding-sibling axes will always be slower, as our storage does not contain backward pointers to preceding nodes. I'm pretty sure, though, that there are ways to either optimize relevant queries or the internal optimizer strategies, but we'll probably have to spend some more time on analyzing the relevant bottlenecks..
Feel free to ask for more, Christian _______________________________________________ BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk
Dear Constantine,
Am 20.10.2011 09:44, schrieb Constantine Hondros:
Of course, I am really interested in more useful queries such as:
//tns:poi[not(tns:id = preceding-sibling::tns:poi/tns:id)]
this query basically compiles to two nested loops, the outer one iterating over //tns:poi and the inner one over all preceding-siblings. So it has an O(n^2) runtime.
I guess what I am asking is ... while most XPath queries I run are quick to resolve, anything using preceding::, preceding-sibling:: or following::, following-sibling:: essentially runs too slowly to be workable - at least, on my sample dataset with about 100K sibling children of the root element.
Well, that's an extreme example, 100'000^2 is a huge number of nodes to traverse. I performed some tests with a) 10'000 sibling elements inside the root: <root> <item id="1"/> <item id="2"/> [...] <item id="10000"/> </root>
and the query //item[not(preceding-sibling::*)]
b) 10'000 nested elements: <root> <item id="1"> <item id="2"> [...] <item id="10000"/> [...] </item> </item> </root>
and the query //item[not(ancestor::*)]
The timings I get are almost identical, replacing "preceding" with "following" or "ancestor" with "descendant" doesn't change much.
So is this any sort of known issue? Should I be looking at any sort of configuration options? Is there any way to profile the execution of the Xpath?
My guess is that the number of nodes on the *-sibling-axes is usually much bigger than on the vertical axes, so an expensive algorithm gets unusable earlier.
To return to your example query, you could drastically reduce its runtime by only traversing the database once, comparing each node only to the nodes with unique tns:ids found before. I'll use a left fold [1][2]:
fold-left( function($seq, $x) { if($seq/tns:id = $x/tns:id) then $seq else ($seq, $x) }, (), //tns:poi )
If you're OK with using not-yet-standard XQuery extensions, a map [3] instead of a sequence can potentially be even faster:
let $map := fold-left( function($map, $x) { let $key := $x/tns:id return if(map:contains($map, $key)) then $map else map:new(($map, map:entry($key, $x))) }, map:new(), //tns:poi ) for $key in map:keys($map) return $map($key)
But this won't preserve the relative ordering of the results.
Hope that helped, Cheers Leo __________
[1] http://www.w3.org/TR/xpath-functions-30/#func-fold-left [2] http://en.wikipedia.org/wiki/Fold_%28higher-order_function%29 [3] http://docs.basex.org/wiki/Map_Functions
basex-talk@mailman.uni-konstanz.de