Hi Gioele,
Thanks again for your interesting query. During my absence, Lukas and Dirk have provided you with many other veritable details on your query than I gave in my iniitial mail. I even bluntly admit that the information I gave you was simply wrong, because the following and following-sibling axes are only slow in BaseX if you use them on main memory XML fragment (which are usually much smaller anyway). If database nodes are accessed, they even provide better performance than the preceding axes.
Thanks for the XML file you sent to me in private. As has already been indicated, the main reason for the bad performance is that the following axis returns many (in total 595237) element nodes, which are then filtered by the predicates. In your particular case, the first part of your query…
//*[@xml:id = "lemma-aMSa"]
…returns a single result. Because of this, it would indeed be possible to optimize your query to stop after the first three results. If if returned two nodes, as detailed by Lukas, this would not be possible anymore, because both nodes could return duplicate nodes.
When diving into our optimization code, I found out that we could actually optimize your query to be evaluated iteratively by evaluating the database statistics: It would tell us that the attribute value "lemma-aMSa" occurs only once in the index, and as a result, the path will only return one result as well.
Your query was quite inspiring, though, which is why I have added a new GitHub issue for it [1]. After the release of BaseX 8.0, we'll try to find out if the optimizations that would speed up your query can be generalized in a way that other queries will be optimized as well. I have just added one specific optimization in the latest snaphshot [2], which ensures that the following rewriting of your query will be evaluated faster than before:
declare namespace tei = 'http://www.tei-c.org/ns/1.0'; ((//*[@xml:id = "lemma-aMSa"])[1] /following::*[self::tei:entry or self::tei:re] )[position() <= 3]
I also had a look at the warning message – and it turns out it's not shown in 8.0 anymore.
Cheers, Christian
[1] https://github.com/BaseXdb/basex/issues/1072 [2] http://files.basex.org/releases/latest/
On Thu, Feb 5, 2015 at 4:25 PM, Lukas Kircher lukaskircher1@gmail.com wrote:
in version 7.9 using the parenthesis like this does not help, I get the same ~250 milliseconds.
Suggest you try the latest snapshot (8.0), things change fast.
Actually I am surprised, as I was expecting this to be slower as it is more general and requires more data to be computed.
(might be wrong here) The results of the descendant step are streamed/pipelined to the following step. Meaning for one result of the descendant step, the following step is evaluated. If there are three results we’re done (w/ brackets). BaseX 8.0 might fix this if 7.9 doesn't.
In my mind this query is harder to optimize than mine, because "officially" the engine would have to: first, find all the nodes following the first node matching `//*[...]`, then all the nodes following the second node matching `//*[...]` and, only at the end, be able to sum them all and select only the first 3 nodes.
‘Officially' is the magic word here (lots of stuff happens ‘actually’) - see above (streaming) + w/o the brackets each result node of '//*[@xml:id = "lemma-aMSa”]’ has to be checked, as the predicate binds to the following step. You’d get the first 3 following nodes of each descendant step result node. If there’s only one, it doesn’t matter. Else you cannot optimize your query beyond a certain point.
You see, lots of if’s … but good questions!