Re: [basex-talk] following::* is 100x slower than preceding::*

5 Feb 2015


      Hi Gioele,
Thanks again for your interesting query. During my absence, Lukas and
Dirk have provided you with many other veritable details on your query
than I gave in my iniitial mail. I even bluntly admit that the
information I gave you was simply wrong, because the following and
following-sibling axes are only slow in BaseX if you use them on main
memory XML fragment (which are usually much smaller anyway). If
database nodes are accessed, they even provide better performance than
the preceding axes.
Thanks for the XML file you sent to me in private. As has already been
indicated, the main reason for the bad performance is that the
following axis returns many (in total 595237) element nodes, which are
then filtered by the predicates. In your particular case, the first
part of your query…
//*[@xml:id = "lemma-aMSa"]
…returns a single result. Because of this, it would indeed be possible
to optimize your query to stop after the first three results. If if
returned two nodes, as detailed by Lukas, this would not be possible
anymore, because both nodes could return duplicate nodes.
When diving into our optimization code, I found out that we could
actually optimize your query to be evaluated iteratively by evaluating
the database statistics: It would tell us that the attribute value
"lemma-aMSa" occurs only once in the index, and as a result, the path
will only return one result as well.
Your query was quite inspiring, though, which is why I have added a
new GitHub issue for it [1]. After the release of BaseX 8.0, we'll try
to find out if the optimizations that would speed up your query can be
generalized in a way that other queries will be optimized as well. I
have just added one specific optimization in the latest snaphshot [2],
which ensures that the following rewriting of your query will be
evaluated faster than before:
declare namespace tei = 'http://www.tei-c.org/ns/1.0';
  ((//*[@xml:id = "lemma-aMSa"])[1]
    /following::*[self::tei:entry or self::tei:re]
  )[position() <= 3]
I also had a look at the warning message – and it turns out it's not
shown in 8.0 anymore.
Cheers,
Christian
[1] https://github.com/BaseXdb/basex/issues/1072
[2] http://files.basex.org/releases/latest/
On Thu, Feb 5, 2015 at 4:25 PM, Lukas Kircher lukaskircher1@gmail.com wrote:
...
...
in version 7.9 using the parenthesis like this does not help, I get the same ~250 milliseconds.
Suggest you try the latest snapshot (8.0), things change fast.
...
Actually I am surprised, as I was expecting this to be slower as it is more general and requires more data to be computed.
(might be wrong here) The results of the descendant step are streamed/pipelined to the following step. Meaning for one result of the descendant step, the following step is evaluated. If there are three results we’re done (w/ brackets). BaseX 8.0 might fix this if 7.9 doesn't.
...
In my mind this query is harder to optimize than mine, because "officially" the engine would have to: first, find all the nodes following the first node matching `//*[...]`, then all the nodes following the second node matching `//*[...]` and, only at the end, be able to sum them all and select only the first 3 nodes.
‘Officially' is the magic word here (lots of stuff happens ‘actually’) - see above (streaming) + w/o the brackets each result node of '//*[@xml:id = "lemma-aMSa”]’ has to be checked, as the predicate binds to the following step. You’d get the first 3 following nodes of each descendant step result node. If there’s only one, it doesn’t matter. Else you cannot optimize your query beyond a certain point.
You see, lots of if’s … but good questions!

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

Re: [basex-talk] following::* is 100x slower than preceding::*