Hi,
I’m trying to use BaseX for linguistic queries on a TEI document containing annotated tokens (i.e. tei:w-elements with attributes). I’m specifically interested in distance queries that allow to search for combinations
of token features within a given window (e.g. all nouns that have an adjective ending with ‘lein’ within a distance of 3.) Theoretically, this is rather easy to formulate with an XPath or XQuery expression, but performance is poor when the dataset gets a bit
larger (in my case, I have a total of 190.000 tokens in my test document, attribute and text indexes created).
This is what I essentially try to do as a simple XPath:
declare default element namespace "http://www.tei-c.org/ns/1.0";
//w[@type = "NN"][(subsequence(preceding::w, 1, 3), subsequence(following::w, 1, 3))/@type = "ADJA"]
Since tokens may be interwoven with markup, I have to use preceding::* or following::*
A simple XQuery returning all matches including their context would look like this:
declare default element namespace "http://www.tei-c.org/ns/1.0";
let $window := 3
let $matches := //w[@type = "NN"]
return
for $m in $matches
let $pre := subsequence($m/preceding-sibling::w, 1, $window)
let $next := subsequence($m/following-sibling::w, 1, $window)
return
if (($pre,$next)[@type = "ADJA"])
then
<conc>
<pre>{$pre}</pre>
<match>{$m}</match>
<next>{$next}</next>
</conc>
else ()
With $matches being a sequence of ca. 50.000 elements, a FLOWR is a bit too costly, I fear; limiting $matches to ~ 1.000 items performs within 5600ms (returning 154 items), but performance decreases rapidly after that
(not to speak about setting a larger distance) .
So, my question is: Is there a way to improve performance on operations like these (without resorting to changing the input document)?
I’d be glad to provide my dataset off list, if this helps.
Thanks,
Daniel