Re: [basex-talk] More Diacritic Questions

23 Nov 2014


      What does "without diacritics" mean?
If it's equivalent to running
normalize-unicode(replace(normalize-unicode($token,'NFKD'),'\p{Mn}',''),'NFKC')
on the tokens we shouldn't expect all the diacritics to go away; cases
like U+00F8  ("latin small letter o with stroke"), despite the
descriptive name, doesn't have a decomposition (= it's a full letter).
(Though both Chris' dotted letters do decompose so their dots should
go away under that scheme.)
The documentation says "If diacritics is insensitive, characters with
and without diacritics (umlauts, characters with accents) are declared
as identical."  without defining diacritics. So far as I can tell from
a quick check, the XFTQ spec doesn't say what a diacritic is, either,
it just says there's a collation defined for the purpose and maybe it
can't always cope algorithmically.
I'd like to advocate for an equivalent to the "decomposed normal form,
strip the non-spacing modifier characters, recompose to composed
normal form" equivalence because at least that one is plausibly well
understood.  If that's not it, can we get the collation in the
documentation somewhere?
-- Graydon, who hopes that made sense
On Sun, Nov 23, 2014 at 1:29 PM, Chris Yocum cyocum@gmail.com wrote:
...
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256
Hi Chrsitian
Thanks for letting me know!  I also need ḟ U+1E1F.
All the best,
Chris
On Sun, Nov 23, 2014 at 06:22:39PM +0100, Christian Grün wrote:
...
Hi Chris,
Thanks for the observation. I can confirm that some characters like ṡ
(U+1E61) do not seem be properly normalized yet. I have added an issue
for that [1], and I hope I will soon have it fixed.
If you encounter some other surprising behavior like this, feel free to tell us.
Best,
Christian
[1] https://github.com/BaseXdb/basex/issues/1029
...
Hi Everyone,
I am rather confused again about diacritic handling in basex.  For
instance, with Full Text turned on a word like athgabáil will match
both athgabail and athgabáil with "diacritics insensitive" which is
what I would expect.  However, if I try to match a word like
cúachṡnaidm (with s with a dot above), it will not match without
"diacritics sensitive" turned on in the query itself.  I am rather
confused why it would match in the case of athgabáil but not in the
case of cúachṡnaidm.  Does anyone know why this is happening and how
to make it match like the other match?
All the best,
Chris
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iF4EAREIAAYFAlRyFScACgkQDjE+CSbP7HogoQD/eyQT0ioILJv3KHL1cKyNi+Ht
a8s0g3Az5o6Uvnniw50A/Reyycsn8YiIY3naEOPuwKydBOUwrYKSe5fjOTqtZyXl
=jycQ
-----END PGP SIGNATURE-----
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iF4EAREIAAYFAlRyJ6YACgkQDjE+CSbP7HpN8AEAnkcKGXtUxGJzWIsaTsdDhvbS
NC44Gc2qp04bsyc6YCIBAKeVf1CBbWnGl9cs4RUe3tPqPSv0T0FNIIQ0/45dUBfM
=mfHa
-----END PGP SIGNATURE-----

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

Re: [basex-talk] More Diacritic Questions