Hello,
I am working on an application that retrieves its data from a TEI XML file via BaseX. The following query lies at the core of this application but is too slow to be used in production: on a modern PC it requires about 600 ms to run over a 4MB file (1/10 of the complete dataset). Any suggestion on how to improve its performance (without changing the underlying TEI files) would be much appreciated.
Here is the query:
declare namespace tei='http://www.tei-c.org/ns/1.0';
/tei:TEI/tei:text/tei:body// *[self::tei:entry or self::tei:re] [./tei:form/tei:orth[. = "arci"] [ancestor-or-self::* [@xml:lang][1] [(starts-with(@xml:lang, "san"))] ] ]
In human terms is should return all the `tei:entry` or `tei:re` that
* have the word "arci" in their `/tei:form/tei:orth` element, * their nearest `xml:lang` attribute starts with 'san'.
I made some tests and it turned out that the main culprit is the use of `//` in the first line. (_Main_ culprit, not the only one...)
I use the `//` axis because I do not know what is the structure of the underlying TEI file. I expect BaseX to keep track of all the `tei:entry` and `tei:re` elements and their parents, so selecting the correct ones should be quite fast anyway. But the measurements disagree with my assumptions...
What could I do to improve the performance of this query?
Now, some remarks based on some small tests I have done:
1. Removing the
[ancestor-or-self::*[....]]
predicate slashes the run time in half, but the query is still way too slow.
2. Changing
./tei:form/tei:orth[. = "arci"]
to
./tei:form[1]/tei:orth[1][. = "arci"]
makes the query even slower.
3. changing `starts-with(@xml:lang, "san")` to `@xml:lang = 'san-xxx'` has a negligible effect.
4. Dropping the `[1]` from
[@xml:lang][1]
makes the whole query twice as fast.
Regards,
-- Gioele Barabucci gioele@svario.it
Hello Gioele,
I have a souvenir that the use of namespaces was slowing down (or maybe invalidating) the structure index. Someone @BaseX will certainly correct me if I am wrong, but if your data is single namespaced, what about reloading data with the "skip namespaces" option enabled and test if performance improves ?
Another solution could be to create an index collection, where key would be your search terms, and values the node-pre or node-id of your (sub-)documents.
Best regards, Fabrice
-----Message d'origine----- De : basex-talk-bounces@mailman.uni-konstanz.de [mailto:basex-talk-bounces@mailman.uni-konstanz.de] De la part de Gioele Barabucci Envoyé : vendredi 12 juin 2015 10:42 À : basex-talk@mailman.uni-konstanz.de Objet : [basex-talk] Optimization of a slow query with `//`
Hello,
I am working on an application that retrieves its data from a TEI XML file via BaseX. The following query lies at the core of this application but is too slow to be used in production: on a modern PC it requires about 600 ms to run over a 4MB file (1/10 of the complete dataset). Any suggestion on how to improve its performance (without changing the underlying TEI files) would be much appreciated.
Here is the query:
declare namespace tei='http://www.tei-c.org/ns/1.0';
/tei:TEI/tei:text/tei:body// *[self::tei:entry or self::tei:re] [./tei:form/tei:orth[. = "arci"] [ancestor-or-self::* [@xml:lang][1] [(starts-with(@xml:lang, "san"))] ] ]
In human terms is should return all the `tei:entry` or `tei:re` that
* have the word "arci" in their `/tei:form/tei:orth` element, * their nearest `xml:lang` attribute starts with 'san'.
I made some tests and it turned out that the main culprit is the use of `//` in the first line. (_Main_ culprit, not the only one...)
I use the `//` axis because I do not know what is the structure of the underlying TEI file. I expect BaseX to keep track of all the `tei:entry` and `tei:re` elements and their parents, so selecting the correct ones should be quite fast anyway. But the measurements disagree with my assumptions...
What could I do to improve the performance of this query?
Now, some remarks based on some small tests I have done:
1. Removing the
[ancestor-or-self::*[....]]
predicate slashes the run time in half, but the query is still way too slow.
2. Changing
./tei:form/tei:orth[. = "arci"]
to
./tei:form[1]/tei:orth[1][. = "arci"]
makes the query even slower.
3. changing `starts-with(@xml:lang, "san")` to `@xml:lang = 'san-xxx'` has a negligible effect.
4. Dropping the `[1]` from
[@xml:lang][1]
makes the whole query twice as fast.
Regards,
-- Gioele Barabucci gioele@svario.it
I don't have any TEI documents at hand, but maybe something like:
/tei:TEI/tei:text/tei:body //*[starts-with(@xml:lang, "san")] //(tei:entry | tei:re) [./tei:form/tei:orth = "arci"]
That would select (I believe) all elements with @xml:lang starting with "san" that have as a descendant either a tei:entry or tei:re who's tei:form/tei:orth is "arci".
I guess you could do the other way around as well, to first select everything that has it's tei:orth = "arci", and limit it with you specified language. That might be faster depending if there are more few tei:orth = "arci" elements, than there is elements with their @xml:lang starting with "san".
I hope I don't lie and assume too much here. Kristian K
12.06.2015 11:42, Gioele Barabucci kirjutas:
Hello,
I am working on an application that retrieves its data from a TEI XML file via BaseX. The following query lies at the core of this application but is too slow to be used in production: on a modern PC it requires about 600 ms to run over a 4MB file (1/10 of the complete dataset). Any suggestion on how to improve its performance (without changing the underlying TEI files) would be much appreciated.
Here is the query:
declare namespace tei='http://www.tei-c.org/ns/1.0'; /tei:TEI/tei:text/tei:body// *[self::tei:entry or self::tei:re] [./tei:form/tei:orth[. = "arci"] [ancestor-or-self::* [@xml:lang][1] [(starts-with(@xml:lang, "san"))] ] ]
In human terms is should return all the `tei:entry` or `tei:re` that
- have the word "arci" in their `/tei:form/tei:orth` element,
- their nearest `xml:lang` attribute starts with 'san'.
I made some tests and it turned out that the main culprit is the use of `//` in the first line. (_Main_ culprit, not the only one...)
I use the `//` axis because I do not know what is the structure of the underlying TEI file. I expect BaseX to keep track of all the `tei:entry` and `tei:re` elements and their parents, so selecting the correct ones should be quite fast anyway. But the measurements disagree with my assumptions...
What could I do to improve the performance of this query?
Now, some remarks based on some small tests I have done:
Removing the
[ancestor-or-self::*[....]]
predicate slashes the run time in half, but the query is still way too slow.
Changing
./tei:form/tei:orth[. = "arci"]
to
./tei:form[1]/tei:orth[1][. = "arci"]
makes the query even slower.
- changing `starts-with(@xml:lang, "san")` to `@xml:lang = 'san-xxx'`
has a negligible effect.
Dropping the `[1]` from
[@xml:lang][1]
makes the whole query twice as fast.
Regards,
-- Gioele Barabucci gioele@svario.it
Am 12.06.2015 um 11:21 schrieb Kristian Kankainen:
I don't have any TEI documents at hand, but maybe something like:
/tei:TEI/tei:text/tei:body //*[starts-with(@xml:lang, "san")] //(tei:entry | tei:re)
[./tei:form/tei:orth = "arci"]
That would select (I believe) all elements with @xml:lang starting with "san" that have as a descendant either a tei:entry or tei:re who's tei:form/tei:orth is "arci".
Thank you for your suggestion, sadly this would incorrectly select this entry
<tei:entry xml:lang="san-Latn"> <tei:re xml:lang="it"> tei:formtei:ortharci</tei:orth></tei:form> </tei:re> </tei:entry>
I guess you could do the other way around as well, to first select everything that has it's tei:orth = "arci", and limit it with you specified language.
I think I already do that, don't I?
I read this as "of all the elements with tei:orth = 'arci', select those with @xml:lang..."
/tei:TEI/tei:text/tei:body// *[self::tei:entry or self::tei:re] [./tei:form/tei:orth[. = "arci"] [ancestor-or-self::* [@xml:lang][1] [(starts-with(@xml:lang, "san"))] ] ]
Regards,
-- Gioele gioele@svario.it
Gioele, did you check in the execution plan that you query does use an index ?
One way to force the use of the text index could be to start your query with : db:text('your-collection-name', 'arci')/parent::tei:orth/ and so on.
Regards,
-----Message d'origine----- De : Fabrice Etanchaud Envoyé : vendredi 12 juin 2015 11:13 À : basex-talk@mailman.uni-konstanz.de Objet : RE: [basex-talk] Optimization of a slow query with `//`
Hello Gioele,
I have a souvenir that the use of namespaces was slowing down (or maybe invalidating) the structure index. Someone @BaseX will certainly correct me if I am wrong, but if your data is single namespaced, what about reloading data with the "skip namespaces" option enabled and test if performance improves ?
Another solution could be to create an index collection, where key would be your search terms, and values the node-pre or node-id of your (sub-)documents.
Best regards, Fabrice
-----Message d'origine----- De : basex-talk-bounces@mailman.uni-konstanz.de [mailto:basex-talk-bounces@mailman.uni-konstanz.de] De la part de Gioele Barabucci Envoyé : vendredi 12 juin 2015 10:42 À : basex-talk@mailman.uni-konstanz.de Objet : [basex-talk] Optimization of a slow query with `//`
Hello,
I am working on an application that retrieves its data from a TEI XML file via BaseX. The following query lies at the core of this application but is too slow to be used in production: on a modern PC it requires about 600 ms to run over a 4MB file (1/10 of the complete dataset). Any suggestion on how to improve its performance (without changing the underlying TEI files) would be much appreciated.
Here is the query:
declare namespace tei='http://www.tei-c.org/ns/1.0';
/tei:TEI/tei:text/tei:body// *[self::tei:entry or self::tei:re] [./tei:form/tei:orth[. = "arci"] [ancestor-or-self::* [@xml:lang][1] [(starts-with(@xml:lang, "san"))] ] ]
In human terms is should return all the `tei:entry` or `tei:re` that
* have the word "arci" in their `/tei:form/tei:orth` element, * their nearest `xml:lang` attribute starts with 'san'.
I made some tests and it turned out that the main culprit is the use of `//` in the first line. (_Main_ culprit, not the only one...)
I use the `//` axis because I do not know what is the structure of the underlying TEI file. I expect BaseX to keep track of all the `tei:entry` and `tei:re` elements and their parents, so selecting the correct ones should be quite fast anyway. But the measurements disagree with my assumptions...
What could I do to improve the performance of this query?
Now, some remarks based on some small tests I have done:
1. Removing the
[ancestor-or-self::*[....]]
predicate slashes the run time in half, but the query is still way too slow.
2. Changing
./tei:form/tei:orth[. = "arci"]
to
./tei:form[1]/tei:orth[1][. = "arci"]
makes the query even slower.
3. changing `starts-with(@xml:lang, "san")` to `@xml:lang = 'san-xxx'` has a negligible effect.
4. Dropping the `[1]` from
[@xml:lang][1]
makes the whole query twice as fast.
Regards,
-- Gioele Barabucci gioele@svario.it
Am 12.06.2015 um 11:46 schrieb Fabrice Etanchaud:
Gioele, did you check in the execution plan that you query does use an index ?
One way to force the use of the text index could be to start your query with : db:text('your-collection-name', 'arci')/parent::tei:orth/ and so on.
Hi Fabrice,
first, let me thank you for your suggestion: using `db:text()` drops the query time from 600ms to 2ms!
declare namespace tei='http://www.tei-c.org/ns/1.0';
db:text('collection', 'arci')/ parent::tei:orth [ancestor-or-self::* [@xml:lang][1] [(starts-with(@xml:lang, "san"))] ] /parent::tei:form/parent::*[self::tei:entry or self::tei:re]
Sadly this will not work for other similar queries that use `[contains(., "text")]` instead of `[. = "text"]`, so I will keep researching more general solutions for the other cases.
Going back to your first question, how can I check that I am in fact using an index for a certain query? I know that the index is enabled, but I am not sure if the query engine is making any use of it.
Regards,
-- Gioele Barabucci gioele@svario.it
Hi Gioele,
It's usually a difficult task for the query compiler to rewrite nested predicates. The following query may be evaluated faster (as I don't have access to your data, I couldn't test it):
declare namespace tei='http://www.tei-c.org/ns/1.0';
/descendant::tei:orth [text() = "arci"] [ancestor-or-self::* [@xml:lang][1][starts-with(@xml:lang, "san")] ] /parent::tei:form /(parent::tei:entry | parent::tei:re) [parent::tei:body/parent::tei:text/parent::TEI /parent::document-node()]
Depending on the structure of your data, it may be possible to simplify some of the predicates. As Fabrice suggested, you should check the query info output in order to see if the text index is utilized.
Christian
declare namespace tei='http://www.tei-c.org/ns/1.0'; /tei:TEI/tei:text/tei:body// *[self::tei:entry or self::tei:re] [./tei:form/tei:orth[. = "arci"] [ancestor-or-self::* [@xml:lang][1] [(starts-with(@xml:lang, "san"))] ] ]
In human terms is should return all the `tei:entry` or `tei:re` that
- have the word "arci" in their `/tei:form/tei:orth` element,
- their nearest `xml:lang` attribute starts with 'san'.
I made some tests and it turned out that the main culprit is the use of `//` in the first line. (_Main_ culprit, not the only one...)
I use the `//` axis because I do not know what is the structure of the underlying TEI file. I expect BaseX to keep track of all the `tei:entry` and `tei:re` elements and their parents, so selecting the correct ones should be quite fast anyway. But the measurements disagree with my assumptions...
What could I do to improve the performance of this query?
Now, some remarks based on some small tests I have done:
Removing the
[ancestor-or-self::*[....]]
predicate slashes the run time in half, but the query is still way too slow.
Changing
./tei:form/tei:orth[. = "arci"]
to
./tei:form[1]/tei:orth[1][. = "arci"]
makes the query even slower.
- changing `starts-with(@xml:lang, "san")` to `@xml:lang = 'san-xxx'` has a
negligible effect.
Dropping the `[1]` from
[@xml:lang][1]
makes the whole query twice as fast.
Regards,
-- Gioele Barabucci gioele@svario.it
Am 12.06.2015 um 18:31 schrieb Christian Grün:
Hi Gioele,
It's usually a difficult task for the query compiler to rewrite nested predicates. The following query may be evaluated faster (as I don't have access to your data, I couldn't test it):
declare namespace tei='http://www.tei-c.org/ns/1.0';
/descendant::tei:orth [text() = "arci"] [ancestor-or-self::* [@xml:lang][1][starts-with(@xml:lang, "san")] ] /parent::tei:form /(parent::tei:entry | parent::tei:re) [parent::tei:body/parent::tei:text/parent::TEI /parent::document-node()]
Hi Christian,
your query executes indeed much faster than mine: ~130 ms vs 600 ms.
My question is, would it be hard to detect a `[text() = X]` predicate and turn it into a `db:text()` query as suggested by Fabrice?
In my case that optimization would turn
/descendant::tei:orth [text() = "arci"] [ancestor-or-self::* [@xml:lang][1][starts-with(@xml:lang, "san")] ] /parent::tei:form/parent::*[self::tei:entry | self::tei:re]
into
db:text('collection', 'arci')/ parent::tei:orth [ancestor-or-self::* [@xml:lang][1][(starts-with(@xml:lang, "san"))] ] /parent::tei:form/parent::*[self::tei:entry or self::tei:re]
by hoisting the `text()` comparison and inverting the direction of the axis, from `parent::` to `descendant::`.
It would be nice to be able to write to two queries that look almost the same but one uses `text() = X` while the other uses `contains(text(), X)` and have BaseX optimize them in different ways (db:text vs. full text search). :)
If you want I can send you the data privately.
Regards,
-- Gioele Barabucci <gioele@svario.it.
My question is, would it be hard to detect a `[text() = X]` predicate and turn it into a `db:text()` query as suggested by Fabrice?
This should already happen. Did you have a look at the compiled query (GUI: Info View; Command Line: -V; Option: QUERYINFO)?
Feel free to provide me the XML data you are working with. Christian
In my case that optimization would turn
/descendant::tei:orth [text() = "arci"] [ancestor-or-self::* [@xml:lang][1][starts-with(@xml:lang, "san")] ] /parent::tei:form/parent::*[self::tei:entry | self::tei:re]
into
db:text('collection', 'arci')/ parent::tei:orth [ancestor-or-self::* [@xml:lang][1][(starts-with(@xml:lang, "san"))] ] /parent::tei:form/parent::*[self::tei:entry or self::tei:re]
by hoisting the `text()` comparison and inverting the direction of the axis, from `parent::` to `descendant::`.
It would be nice to be able to write to two queries that look almost the same but one uses `text() = X` while the other uses `contains(text(), X)` and have BaseX optimize them in different ways (db:text vs. full text search). :)
If you want I can send you the data privately.
Regards,
-- Gioele Barabucci <gioele@svario.it.
basex-talk@mailman.uni-konstanz.de