Dear BaseX people,

it would be kind if you could check my understanding of Free Text @ BaseX.

(1) Tokenization ignores element borders.

If correct, I suggest documentation of the fact, see

https://www.w3.org/TR/xpath-full-text-10/#TokenizationSec

"In the absence of an implementation-defined way to differentiate, element markup (start tags, end tags, and empty-element tags) creates token boundaries."

(2) Function ft:search can only find individual text nodes, it is not possible to apply scope "phrase" or "all words" beyond the boundaries of an individual text node. So, for example, given a document

<doc>

<t1>Stand </t1>

<t2>der Information. Siehe unten.</t2>

</doc>

there is no way of searching for "Stand der Information" *and* obtain information about the location of the match (in other words - search via ft:search).

(3) The unit "sentence" (as for example used in the qualifier same sentence) is exclusively defined by the occurrences of "." (dot) characters. In particular, it is unrelated to text node boundaries. For example:

$doc contains text "Stand der Information siehe" same sentence yields false.

$doc contains text "Stand der Information" same sentence yields true.

(4) The unit "paragraph" (as for example used in the qualifier "same paragraph") is not delimited - "same paragraph" always applies.

A check would be highly appreciated!

Kind regards,

Hans-Jürgen

PS: I think there is a bug concerning "different sentence":

basex "'base.x' contains text 'base x' same sentence"
false

basex "'base.x' contains text 'base x' different sentence"
false

basex "'base x' contains text 'base x' same sentence"
true

basex "'base x' contains text 'base x' different sentence"
false

PPS: Thank you very much for the excellent implementation of Free Text - for several years, it has been in productive use by a mission critical service mapping format markup to semantic markup.