Hi,

I’m trying to use BaseX for linguistic queries on a TEI document containing annotated tokens (i.e. tei:w-elements with attributes). I’m specifically interested in distance queries that allow to search for combinations of token features within a given window (e.g. all nouns that have an adjective ending with ‘lein’ within a distance of 3.) Theoretically, this is rather easy to formulate with an XPath or XQuery expression, but performance is poor when the dataset gets a bit larger (in my case, I have a total of 190.000 tokens in my test document, attribute and text indexes created).

 

This is what I essentially try to do as a simple XPath:

 

declare default element namespace  "http://www.tei-c.org/ns/1.0";

//w[@type = "NN"][(subsequence(preceding::w, 1, 3), subsequence(following::w, 1, 3))/@type = "ADJA"]

 

Since tokens may be interwoven with markup, I have to use preceding::* or following::*

 

A simple XQuery returning all matches including their context would look like this:

 

declare default element namespace  "http://www.tei-c.org/ns/1.0";

 

let $window := 3

let $matches := //w[@type = "NN"]

return

 

for $m in $matches

let $pre := subsequence($m/preceding-sibling::w, 1, $window)

let $next := subsequence($m/following-sibling::w, 1, $window)

return

  if (($pre,$next)[@type = "ADJA"])

  then

    <conc>

      <pre>{$pre}</pre>

      <match>{$m}</match>

      <next>{$next}</next>

    </conc>

  else ()

 

With $matches being a sequence of ca. 50.000 elements, a FLOWR is a bit too costly, I fear; limiting $matches to ~ 1.000 items performs within 5600ms (returning 154 items), but performance decreases rapidly after that (not to speak about setting a larger distance) .

So, my question is: Is there a way to improve performance on operations like these (without resorting to changing the input document)?

 

I’d be glad to provide my dataset off list, if this helps.

 

Thanks,

Daniel