Re: [basex-talk] Question about ft:normalize

23 Nov 2021


      It’s US-ASCII (7 bit).
Tim Thompson timathom@gmail.com schrieb am Di., 23. Nov. 2021, 17:07:
...
Thanks, Christian. What is the effective character set used when
diacritics are removed? Latin-1?
Tim
--
Tim A. Thompson (*he, him*)
Librarian for Applied Metadata Research
Yale University Library
www.linkedin.com/in/timathompson
timothy.thompson@yale.edu
On Mon, Nov 22, 2021 at 2:53 PM Christian Grün christian.gruen@gmail.com
wrote:
...
Hi Tim,
...
I have a question about the BaseX ft:normalize function. What kind of
Unicode normalization is performed by this function, and how might it be
implemented using standard XPath functions?
The function is based on a custom BaseX tokenization, which includes
normalization of case, removal of diacritics and (if enabled)
language-based stemming. It would be rather challenging to implement
the behavior with standard XPath (that’s mostly why we introduced
ft:tokenize and ft:normalize). If you are looking for a starting
point, you could begin with the FtTokenize Java class [1].
Hope this helps,
Christian
[1]
https://github.com/BaseXdb/basex/blob/master/basex-core/src/main/java/org/ba...

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

Re: [basex-talk] Question about ft:normalize