Hi Hans-Jürgen,
it would be kind if you could check my understanding of Free Text @ BaseX.
You mean Full Text?
(1) Tokenization ignores element borders.
You could say so. Before tokenization, a node that’s to be tokenized will be atomized, similar to when you apply fn:data to it. For example, the following function call returns 'hi' and 'there':
ft:tokenize(<div><b>H</b>i there</div>)
(2) Function ft:search can only find individual text nodes, it is not possible to apply scope "phrase" or "all words" beyond the boundaries of an individual text node.
Exactly. This could possibly change in a future version. Maybe you’ve seen the issue that I have mentioned in a previous mailing list thread [1]. I haven’t got any feedback on the proposal yet.
(3) The unit "sentence" (as for example used in the qualifier same sentence) is exclusively defined by the occurrences of "." (dot) characters. (4) The unit "paragraph" (as for example used in the qualifier "same paragraph") is not delimited - "same paragraph" always applies.
Unit detection is very basic. For Western languages, it’s currently limited to (3) dots, exclamation and question marks, and (4) to newlines [2].
basex "'base.x' contains text 'base x' different sentence" false
Surprising indeed; I will look at that [3].
Thanks and cheers, Christian
[1] https://github.com/BaseXdb/basex/issues/2079 [2] https://github.com/BaseXdb/basex/blob/da1e55d0214e44c1532f121c282021db50a9aa... [3] https://github.com/BaseXdb/basex/issues/2088