following::* is 100x slower than preceding::*

List overview All Threads
Download

newer

older

Regression? Unknown parameter:...

BaseX in maven

Gioele Barabucci

5 Feb 2015 5 Feb '15

8:12 a.m.

Hello,

I have noticed that this query using the "following" axes

//*[@xml:id = "lemma-aMSa"] /following::*[self::tei:entry or self::tei:re] [position() <= 3]

is much slower than the same query with the "preceding" axes

//*[@xml:id = "lemma-aMSa"] /preceding::*[self::tei:entry or self::tei:re] [position() <= 3]

The query that uses "preceding" takes about 2.5 ms to execute, while the one using "following" takes about 250 ms: it is 100 times slower.

Why this discrepancy between these two queries?

I can provide the base XML file (19MB) on request.

Also, I also get a warning about «'following::*[(self::tei:entry or self::tei:re)][(fn:position() <= 3)]' will never yield results.» but that is obviously false, as it yields exactly the 3 results I expect.

Regards,

-- Gioele Barabucci gioele@svario.it

Show replies by date

Christian Grün

5 Feb 5 Feb

8:23 a.m.

Hi Gioele,

I can confirm that the following axis is pretty expensive in BaseX, as we do not store explicit sibling references. The preceding axis is cheaper as we can stop search as soon as we traverse over the node we started from.

One way out is to first access the following nodes in your document and move the preceding node check in a predicate.

...

Also, I also get a warning about «'following::*[(self::tei:entry or self::tei:re)][(fn:position() <= 3)]' will never yield results.» but that is obviously false, as it yields exactly the 3 results I expect.

That's surprising indeed. Yes, feel free to send me your XML document in private.

Hope this helps, Christian

On Thu, Feb 5, 2015 at 2:12 PM, Gioele Barabucci gioele@svario.it wrote:

...

Hello,

I have noticed that this query using the "following" axes
//*[@xml:id = "lemma-aMSa"]
        /following::*[self::tei:entry or self::tei:re]
                     [position() <= 3]
is much slower than the same query with the "preceding" axes
//*[@xml:id = "lemma-aMSa"]
        /preceding::*[self::tei:entry or self::tei:re]
                     [position() <= 3]
The query that uses "preceding" takes about 2.5 ms to execute, while the one using "following" takes about 250 ms: it is 100 times slower.

Why this discrepancy between these two queries?

I can provide the base XML file (19MB) on request.

Also, I also get a warning about «'following::*[(self::tei:entry or self::tei:re)][(fn:position() <= 3)]' will never yield results.» but that is obviously false, as it yields exactly the 3 results I expect.

Regards,

-- Gioele Barabucci gioele@svario.it

Lukas Kircher

8:56 a.m.

Hi Gioele,

I just want to add a quick clarification as I feel like some stuff got mixed up here (Christian may correct me if his version is backed by some compiler-voodoo that I’m neglecting, as he definitely knows more about the matter).

In the end, the performance depends strongly on your document structure. The ‘following’ version is expensive if the resulting node(s) of '//*[@xml:id = "lemma-aMSa”]’ have a lot of following nodes (siblings included). If there are lots of preceding nodes, the ‘preceding’ query version could be more expensive.

Point being, evaluation of the following axis is not exactly expensive in BaseX, at least compared to the preceding axis (also mind, that we have a reference to the first following node of a node, but not to the preceding node). Arriving at a conclusion about axis evaluation performance cannot be based on the two given queries (as they are non-equivalent).

Hope this doesn’t add to the confusion, though I think it does - Lukas

...

On 05 Feb 2015, at 14:23, Christian Grün christian.gruen@gmail.com wrote:

Hi Gioele,

I can confirm that the following axis is pretty expensive in BaseX, as we do not store explicit sibling references. The preceding axis is cheaper as we can stop search as soon as we traverse over the node we started from.

One way out is to first access the following nodes in your document and move the preceding node check in a predicate.

...
Also, I also get a warning about «'following::*[(self::tei:entry or self::tei:re)][(fn:position() <= 3)]' will never yield results.» but that is obviously false, as it yields exactly the 3 results I expect.

That's surprising indeed. Yes, feel free to send me your XML document in private.

Hope this helps, Christian

On Thu, Feb 5, 2015 at 2:12 PM, Gioele Barabucci gioele@svario.it wrote:

...
Hello,

I have noticed that this query using the "following" axes

//*[@xml:id = "lemma-aMSa"] /following::*[self::tei:entry or self::tei:re] [position() <= 3]

is much slower than the same query with the "preceding" axes

//*[@xml:id = "lemma-aMSa"] /preceding::*[self::tei:entry or self::tei:re] [position() <= 3]

The query that uses "preceding" takes about 2.5 ms to execute, while the one using "following" takes about 250 ms: it is 100 times slower.

Why this discrepancy between these two queries?

I can provide the base XML file (19MB) on request.

Also, I also get a warning about «'following::*[(self::tei:entry or self::tei:re)][(fn:position() <= 3)]' will never yield results.» but that is obviously false, as it yields exactly the 3 results I expect.

Regards,

-- Gioele Barabucci gioele@svario.it

Gioele Barabucci

9:09 a.m.

Am 05.02.2015 um 14:56 schrieb Lukas Kircher:

...

In the end, the performance depends strongly on your document structure. The ‘following’ version is expensive if the resulting node(s) of '//*[@xml:id = "lemma-aMSa”]’ have a lot of following nodes (siblings included). If there are lots of preceding nodes, the ‘preceding’ query version could be more expensive.

Indeed in this case that node has a hundred preceding elements and a dozen thousand following nodes, i.e. 100x more following nodes.

Looking again at the query,

//*[@xml:id = "lemma-aMSa"] /following::*[self::tei:entry or self::tei:re] [position() <= 3]

my question is: once the engine has found the node(s) `//*[@xml:id = "lemma-aMSa”]`, couldn't it stop searching for nodes that match `following::*[(self::tei:entry or self::tei:re)]` once it has found 3 of them?

Or, in other words, couldn't a positional predicate like `[position() <= 3]` be used to short-circuit the preceding part of the query? I am pretty sure that Saxon, for example, does exactly that for simple positional predicates like `[1]` or `[position() <= X]`.

Regards,

-- Gioele Barabucci gioele@svario.it

Dirk Kirsten

9:24 a.m.

Hello Gioele,

just my two cents as well...

Your predicate [position() <= 3] applies to the following nodes, not the node with the given ID. You need to use parenthesis to do so:

(//*[@xml:id = "lemma-aMSa"]/following::*[self::tei:entry or self::tei:re])[position() <= 3]

Then I am quite sure that BaseX will also stop once it found the three matching elements.

Cheers Dirk

On 02/05/2015 03:09 PM, Gioele Barabucci wrote:

...

Am 05.02.2015 um 14:56 schrieb Lukas Kircher:

...
In the end, the performance depends strongly on your document structure. The ‘following’ version is expensive if the resulting node(s) of '//*[@xml:id = "lemma-aMSa”]’ have a lot of following nodes (siblings included). If there are lots of preceding nodes, the ‘preceding’ query version could be more expensive.

Indeed in this case that node has a hundred preceding elements and a dozen thousand following nodes, i.e. 100x more following nodes.

Looking again at the query,
//*[@xml:id = "lemma-aMSa"]
        /following::*[self::tei:entry or self::tei:re]
                     [position() <= 3]
my question is: once the engine has found the node(s) `//*[@xml:id = "lemma-aMSa”]`, couldn't it stop searching for nodes that match `following::*[(self::tei:entry or self::tei:re)]` once it has found 3 of them?

Or, in other words, couldn't a positional predicate like `[position() <= 3]` be used to short-circuit the preceding part of the query? I am pretty sure that Saxon, for example, does exactly that for simple positional predicates like `[1]` or `[position() <= X]`.

Regards,

-- Gioele Barabucci gioele@svario.it

-- Dirk Kirsten, BaseX GmbH, http://basexgmbh.de |-- Firmensitz: Blarerstrasse 56, 78462 Konstanz |-- Registergericht Freiburg, HRB: 708285, Geschäftsführer: | Dr. Christian Grün, Dr. Alexander Holupirek, Michael Seiferle `-- Phone: 0049 7531 28 28 676, Fax: 0049 7531 20 05 22

Lukas Kircher

9:25 a.m.

...

my question is: once the engine has found the node(s) `//*[@xml:id = "lemma-aMSa”]`, couldn't it stop searching for nodes that match `following::*[(self::tei:entry or self::tei:re)]` once it has found 3 of them?

Correct. Well, maybe.

...

Or, in other words, couldn't a positional predicate like `[position() <= 3]` be used to short-circuit the preceding part of the query? I am pretty sure that Saxon, for example, does exactly that for simple positional predicates like `[1]` or `[position() <= X]`.

Christian is actually doing a very good job enabling the compiler to catch exactely these cases. Unfortunately it is far from trivial to detect each possible scenario. With your given query the optimization might be triggered if you put the first part of the expression in brackets, like:

( //*[@xml:id = "lemma-aMSa"] /following::*[self::tei:entry or self::tei:re] ) [position() <= 3]

Note: this only works, if there’s only one node with the id "lemma-aMSa” in your document, otherwise the expressions are not equivalent. If this is not the case or if the brackets don’t fix it, I’m passing on to Christian for either an explanation or a faster-than-expected fix.

Cheers ...

Lukas Kircher

9:25 a.m.

Draw : ) - 15:25.

...

On 05 Feb 2015, at 15:25, Lukas Kircher lukaskircher1@gmail.com wrote:

...
my question is: once the engine has found the node(s) `//*[@xml:id = "lemma-aMSa”]`, couldn't it stop searching for nodes that match `following::*[(self::tei:entry or self::tei:re)]` once it has found 3 of them?

Correct. Well, maybe.

...
Or, in other words, couldn't a positional predicate like `[position() <= 3]` be used to short-circuit the preceding part of the query? I am pretty sure that Saxon, for example, does exactly that for simple positional predicates like `[1]` or `[position() <= X]`.

Christian is actually doing a very good job enabling the compiler to catch exactely these cases. Unfortunately it is far from trivial to detect each possible scenario. With your given query the optimization might be triggered if you put the first part of the expression in brackets, like:

( //*[@xml:id = "lemma-aMSa"] /following::*[self::tei:entry or self::tei:re] ) [position() <= 3]

Note: this only works, if there’s only one node with the id "lemma-aMSa” in your document, otherwise the expressions are not equivalent. If this is not the case or if the brackets don’t fix it, I’m passing on to Christian for either an explanation or a faster-than-expected fix.

Cheers ...

Gioele Barabucci

10 a.m.

Am 05.02.2015 um 15:25 schrieb Lukas Kircher:

...

...
Or, in other words, couldn't a positional predicate like `[position() <= 3]` be used to short-circuit the preceding part of the query? I am pretty sure that Saxon, for example, does exactly that for simple positional predicates like `[1]` or `[position() <= X]`.

Christian is actually doing a very good job enabling the compiler to catch exactely these cases. Unfortunately it is far from trivial to detect each possible scenario. With your given query the optimization might be triggered if you put the first part of the expression in brackets, like:

( //*[@xml:id = "lemma-aMSa"] /following::*[self::tei:entry or self::tei:re] ) [position() <= 3]

Hi Lukas,

in version 7.9 using the parenthesis like this does not help, I get the same ~250 milliseconds.

Actually I am surprised, as I was expecting this to be slower as it is more general and requires more data to be computed.

In my mind this query is harder to optimize than mine, because "officially" the engine would have to: first, find all the nodes following the first node matching `//*[...]`, then all the nodes following the second node matching `//*[...]` and, only at the end, be able to sum them all and select only the first 3 nodes.

Bye,

-- Gioele Barabucci gioele@svario.it

Lukas Kircher

10:25 a.m.

...

in version 7.9 using the parenthesis like this does not help, I get the same ~250 milliseconds.

Suggest you try the latest snapshot (8.0), things change fast.

...

Actually I am surprised, as I was expecting this to be slower as it is more general and requires more data to be computed.

(might be wrong here) The results of the descendant step are streamed/pipelined to the following step. Meaning for one result of the descendant step, the following step is evaluated. If there are three results we’re done (w/ brackets). BaseX 8.0 might fix this if 7.9 doesn't.

...

In my mind this query is harder to optimize than mine, because "officially" the engine would have to: first, find all the nodes following the first node matching `//*[...]`, then all the nodes following the second node matching `//*[...]` and, only at the end, be able to sum them all and select only the first 3 nodes.

‘Officially' is the magic word here (lots of stuff happens ‘actually’) - see above (streaming) + w/o the brackets each result node of '//*[@xml:id = "lemma-aMSa”]’ has to be checked, as the predicate binds to the following step. You’d get the first 3 following nodes of each descendant step result node. If there’s only one, it doesn’t matter. Else you cannot optimize your query beyond a certain point.

You see, lots of if’s … but good questions!

Christian Grün

2:31 p.m.

Hi Gioele,

Thanks again for your interesting query. During my absence, Lukas and Dirk have provided you with many other veritable details on your query than I gave in my iniitial mail. I even bluntly admit that the information I gave you was simply wrong, because the following and following-sibling axes are only slow in BaseX if you use them on main memory XML fragment (which are usually much smaller anyway). If database nodes are accessed, they even provide better performance than the preceding axes.

Thanks for the XML file you sent to me in private. As has already been indicated, the main reason for the bad performance is that the following axis returns many (in total 595237) element nodes, which are then filtered by the predicates. In your particular case, the first part of your query…

//*[@xml:id = "lemma-aMSa"]

…returns a single result. Because of this, it would indeed be possible to optimize your query to stop after the first three results. If if returned two nodes, as detailed by Lukas, this would not be possible anymore, because both nodes could return duplicate nodes.

When diving into our optimization code, I found out that we could actually optimize your query to be evaluated iteratively by evaluating the database statistics: It would tell us that the attribute value "lemma-aMSa" occurs only once in the index, and as a result, the path will only return one result as well.

Your query was quite inspiring, though, which is why I have added a new GitHub issue for it [1]. After the release of BaseX 8.0, we'll try to find out if the optimizations that would speed up your query can be generalized in a way that other queries will be optimized as well. I have just added one specific optimization in the latest snaphshot [2], which ensures that the following rewriting of your query will be evaluated faster than before:

declare namespace tei = 'http://www.tei-c.org/ns/1.0'; ((//*[@xml:id = "lemma-aMSa"])[1] /following::*[self::tei:entry or self::tei:re] )[position() <= 3]

I also had a look at the warning message – and it turns out it's not shown in 8.0 anymore.

Cheers, Christian

[1] https://github.com/BaseXdb/basex/issues/1072 [2] http://files.basex.org/releases/latest/

On Thu, Feb 5, 2015 at 4:25 PM, Lukas Kircher lukaskircher1@gmail.com wrote:

...

...
in version 7.9 using the parenthesis like this does not help, I get the same ~250 milliseconds.

Suggest you try the latest snapshot (8.0), things change fast.

...
Actually I am surprised, as I was expecting this to be slower as it is more general and requires more data to be computed.

(might be wrong here) The results of the descendant step are streamed/pipelined to the following step. Meaning for one result of the descendant step, the following step is evaluated. If there are three results we’re done (w/ brackets). BaseX 8.0 might fix this if 7.9 doesn't.

...
In my mind this query is harder to optimize than mine, because "officially" the engine would have to: first, find all the nodes following the first node matching `//*[...]`, then all the nodes following the second node matching `//*[...]` and, only at the end, be able to sum them all and select only the first 3 nodes.

‘Officially' is the magic word here (lots of stuff happens ‘actually’) - see above (streaming) + w/o the brackets each result node of '//*[@xml:id = "lemma-aMSa”]’ has to be checked, as the predicate binds to the following step. You’d get the first 3 following nodes of each descendant step result node. If there’s only one, it doesn’t matter. Else you cannot optimize your query beyond a certain point.

You see, lots of if’s … but good questions!

Christian Grün

11 Feb 11 Feb

5:02 p.m.

Hi Gioele,

It's not very loyal to already stop advertising BaseX 8.0 (I surely won't do so), but nonetheless I invite you to check out the latest 8.0.1 snapshot [1], which includes optimizations for the observations made in [2] and in this thread. Your original query will now be evaluated in a few milliseconds. Looking forward to your feedback.

S pozdravem, Christian

[1] http://files.basex.org/releases/latest/ [2] https://github.com/BaseXdb/basex/issues/1072

On Thu, Feb 5, 2015 at 8:31 PM, Christian Grün christian.gruen@gmail.com wrote:

...

Hi Gioele,

Thanks again for your interesting query. During my absence, Lukas and Dirk have provided you with many other veritable details on your query than I gave in my iniitial mail. I even bluntly admit that the information I gave you was simply wrong, because the following and following-sibling axes are only slow in BaseX if you use them on main memory XML fragment (which are usually much smaller anyway). If database nodes are accessed, they even provide better performance than the preceding axes.

Thanks for the XML file you sent to me in private. As has already been indicated, the main reason for the bad performance is that the following axis returns many (in total 595237) element nodes, which are then filtered by the predicates. In your particular case, the first part of your query…

//*[@xml:id = "lemma-aMSa"]

…returns a single result. Because of this, it would indeed be possible to optimize your query to stop after the first three results. If if returned two nodes, as detailed by Lukas, this would not be possible anymore, because both nodes could return duplicate nodes.

When diving into our optimization code, I found out that we could actually optimize your query to be evaluated iteratively by evaluating the database statistics: It would tell us that the attribute value "lemma-aMSa" occurs only once in the index, and as a result, the path will only return one result as well.

Your query was quite inspiring, though, which is why I have added a new GitHub issue for it [1]. After the release of BaseX 8.0, we'll try to find out if the optimizations that would speed up your query can be generalized in a way that other queries will be optimized as well. I have just added one specific optimization in the latest snaphshot [2], which ensures that the following rewriting of your query will be evaluated faster than before:

declare namespace tei = 'http://www.tei-c.org/ns/1.0'; ((//*[@xml:id = "lemma-aMSa"])[1] /following::*[self::tei:entry or self::tei:re] )[position() <= 3]

I also had a look at the warning message – and it turns out it's not shown in 8.0 anymore.

Cheers, Christian

[1] https://github.com/BaseXdb/basex/issues/1072 [2] http://files.basex.org/releases/latest/

On Thu, Feb 5, 2015 at 4:25 PM, Lukas Kircher lukaskircher1@gmail.com wrote:

...
...
in version 7.9 using the parenthesis like this does not help, I get the same ~250 milliseconds.

Suggest you try the latest snapshot (8.0), things change fast.

...
Actually I am surprised, as I was expecting this to be slower as it is more general and requires more data to be computed.

(might be wrong here) The results of the descendant step are streamed/pipelined to the following step. Meaning for one result of the descendant step, the following step is evaluated. If there are three results we’re done (w/ brackets). BaseX 8.0 might fix this if 7.9 doesn't.

...
In my mind this query is harder to optimize than mine, because "officially" the engine would have to: first, find all the nodes following the first node matching `//*[...]`, then all the nodes following the second node matching `//*[...]` and, only at the end, be able to sum them all and select only the first 3 nodes.

‘Officially' is the magic word here (lots of stuff happens ‘actually’) - see above (streaming) + w/o the brackets each result node of '//*[@xml:id = "lemma-aMSa”]’ has to be checked, as the predicate binds to the following step. You’d get the first 3 following nodes of each descendant step result node. If there’s only one, it doesn’t matter. Else you cannot optimize your query beyond a certain point.

You see, lots of if’s … but good questions!

Gioele Barabucci

18 Feb 18 Feb

4:24 a.m.

Am 11.02.2015 um 23:02 schrieb Christian Grün:

...

Hi Gioele,

It's not very loyal to already stop advertising BaseX 8.0 (I surely won't do so), but nonetheless I invite you to check out the latest 8.0.1 snapshot [1], which includes optimizations for the observations made in [2] and in this thread. Your original query will now be evaluated in a few milliseconds. Looking forward to your feedback.

S pozdravem, Christian

[1] http://files.basex.org/releases/latest/ [2] https://github.com/BaseXdb/basex/issues/1072

Thank you Christian,

the query is indeed much faster in the current 8.0.1 snapshot.

I will now go on and see if this "release" helps also with other performance problems we were experiencing.

Bye and thank you!

-- Gioele Barabucci gioele@svario.it

3803

Age (days ago)

3816

Last active (days ago)

basex-talk@mailman.uni-konstanz.de

11 comments

4 participants

tags (0)

participants (4)

Christian Grün
Dirk Kirsten
Gioele Barabucci
Lukas Kircher