Re: [basex-talk] Free Text - understanding

14 Apr 2022


      Hi Hans-Jürgen,
...
it would be kind if you could check my understanding of Free Text @ BaseX.
You mean Full Text?
...
(1) Tokenization ignores element borders.
You could say so. Before tokenization, a node that’s to be tokenized
will be atomized, similar to when you apply fn:data to it. For
example, the following function call returns 'hi' and 'there':
ft:tokenize(<div><b>H</b>i there</div>)
...
(2) Function ft:search can only find individual text nodes, it is not possible to apply scope "phrase" or "all words" beyond the boundaries of an individual text node.
Exactly. This could possibly change in a future version. Maybe you’ve
seen the issue that I have mentioned in a previous mailing list thread
[1]. I haven’t got any feedback on the proposal yet.
...
(3) The unit "sentence" (as for example used in the qualifier same sentence) is exclusively defined by the occurrences of "." (dot) characters.
(4) The unit "paragraph" (as for example used in the qualifier "same paragraph") is not delimited - "same paragraph" always applies.
Unit detection is very basic. For Western languages, it’s currently
limited to (3) dots, exclamation and question marks, and (4) to
newlines [2].
...
basex "'base.x' contains text 'base x' different sentence"
false
Surprising indeed; I will look at that [3].
Thanks and cheers,
Christian
[1] https://github.com/BaseXdb/basex/issues/2079
[2] https://github.com/BaseXdb/basex/blob/da1e55d0214e44c1532f121c282021db50a9aa...
[3] https://github.com/BaseXdb/basex/issues/2088

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

Re: [basex-talk] Free Text - understanding