Hi, I'm trying to use BaseX for linguistic queries on a TEI document containing annotated tokens (i.e. tei:w-elements with attributes). I'm specifically interested in distance queries that allow to search for combinations of token features within a given window (e.g. all nouns that have an adjective ending with 'lein' within a distance of 3.) Theoretically, this is rather easy to formulate with an XPath or XQuery expression, but performance is poor when the dataset gets a bit larger (in my case, I have a total of 190.000 tokens in my test document, attribute and text indexes created).
This is what I essentially try to do as a simple XPath:
declare default element namespace "http://www.tei-c.org/ns/1.0"; //w[@type = "NN"][(subsequence(preceding::w, 1, 3), subsequence(following::w, 1, 3))/@type = "ADJA"]
Since tokens may be interwoven with markup, I have to use preceding::* or following::*
A simple XQuery returning all matches including their context would look like this:
declare default element namespace "http://www.tei-c.org/ns/1.0";
let $window := 3 let $matches := //w[@type = "NN"] return
for $m in $matches let $pre := subsequence($m/preceding-sibling::w, 1, $window) let $next := subsequence($m/following-sibling::w, 1, $window) return if (($pre,$next)[@type = "ADJA"]) then <conc> <pre>{$pre}</pre> <match>{$m}</match> <next>{$next}</next> </conc> else ()
With $matches being a sequence of ca. 50.000 elements, a FLOWR is a bit too costly, I fear; limiting $matches to ~ 1.000 items performs within 5600ms (returning 154 items), but performance decreases rapidly after that (not to speak about setting a larger distance) . So, my question is: Is there a way to improve performance on operations like these (without resorting to changing the input document)?
I'd be glad to provide my dataset off list, if this helps.
Thanks, Daniel
Hello Daniel,
I don't have much time right now, but maybe a few pointers to get you started. I didn't test any of this, so take it with a grain of salt.
However, I guess your subsequence solution is not performing optimal, as I would guess that there really is a new sequence created. So for 50.000 matches you have to create 100.000 new sequences, which is kind of costly. Instead I would recommend using position() to compare the element positions instead and get your window this way. This can operate directly on your data.
Also, did you know that there is a window expression in XQuery 3 (see http://www.w3.org/TR/xquery-30/#id-windows for more)? Looks like an optimal use case here and should also perform much better than subsequences.
Hope this helps, Dirk
On 06/18/2015 12:39 PM, Schopper, Daniel wrote:
Hi, I'm trying to use BaseX for linguistic queries on a TEI document containing annotated tokens (i.e. tei:w-elements with attributes). I'm specifically interested in distance queries that allow to search for combinations of token features within a given window (e.g. all nouns that have an adjective ending with 'lein' within a distance of 3.) Theoretically, this is rather easy to formulate with an XPath or XQuery expression, but performance is poor when the dataset gets a bit larger (in my case, I have a total of 190.000 tokens in my test document, attribute and text indexes created).
This is what I essentially try to do as a simple XPath:
declare default element namespace "http://www.tei-c.org/ns/1.0"; //w[@type = "NN"][(subsequence(preceding::w, 1, 3), subsequence(following::w, 1, 3))/@type = "ADJA"]
Since tokens may be interwoven with markup, I have to use preceding::* or following::*
A simple XQuery returning all matches including their context would look like this:
declare default element namespace "http://www.tei-c.org/ns/1.0";
let $window := 3 let $matches := //w[@type = "NN"] return
for $m in $matches let $pre := subsequence($m/preceding-sibling::w, 1, $window) let $next := subsequence($m/following-sibling::w, 1, $window) return if (($pre,$next)[@type = "ADJA"]) then <conc> <pre>{$pre}</pre> <match>{$m}</match> <next>{$next}</next> </conc> else ()
With $matches being a sequence of ca. 50.000 elements, a FLOWR is a bit too costly, I fear; limiting $matches to ~ 1.000 items performs within 5600ms (returning 154 items), but performance decreases rapidly after that (not to speak about setting a larger distance) . So, my question is: Is there a way to improve performance on operations like these (without resorting to changing the input document)?
I'd be glad to provide my dataset off list, if this helps.
Thanks, Daniel
Hi Daniel,
//w[@type = "NN"][(subsequence(preceding::w, 1, 3), subsequence(following::w, 1, 3))/@type = "ADJA"]
The preceding axis can be quite costly. You could try to use preceding-sibling and following-sibling instead (if it makes sense in your scenario). Another option could be to replace the subsequence function with a predicate: position() = 1 to 3].
I’d be glad to provide my dataset off list, if this helps.
Feel free to do so. Christian
Christian, Dirk, thank you so much for your quick replies! Admittedly, I have been totally unaware of the window expression in XQuery by now, thanks for the excellent hint. Rewriting my previous query with it, I have to say: I'm straightaway stunned by the performance on the current release of BaseX, which is nothing less than amazing. E.g. looking for adjectives preceding nouns in a distance of 3 in my 185000 token test set, the following query returns around 8000 items in 860ms.
declare default element namespace "http://www.tei-c.org/ns/1.0";
let $window := 3 let $toks := //w return for tumbling window $w in $toks start at $s when true() end at $e when $e - $s + 1 = $window let $t1 := $w[@type = "ADJA"] let $t2 := $w[@type = "NN"] where (some $x in $t1, $y in $t2 satisfies $x << $y) return <conc>{$w}</conc >
This looks very promising, or simply put: you made my day :) Best, Daniel
-----Ursprüngliche Nachricht----- Von: Christian Grün [mailto:christian.gruen@gmail.com] Gesendet: Donnerstag, 18. Juni 2015 18:49 An: Schopper, Daniel Cc: basex-talk@mailman.uni-konstanz.de Betreff: Re: [basex-talk] performance of preceding/following axis
Hi Daniel,
//w[@type = "NN"][(subsequence(preceding::w, 1, 3), subsequence(following::w, 1, 3))/@type = "ADJA"]
The preceding axis can be quite costly. You could try to use preceding-sibling and following-sibling instead (if it makes sense in your scenario). Another option could be to replace the subsequence function with a predicate: position() = 1 to 3].
I’d be glad to provide my dataset off list, if this helps.
Feel free to do so. Christian
basex-talk@mailman.uni-konstanz.de