What does "without diacritics" mean?
If it's equivalent to running
normalize-unicode(replace(normalize-unicode($token,'NFKD'),'\p{Mn}',''),'NFKC')
on the tokens we shouldn't expect all the diacritics to go away; cases like U+00F8 ("latin small letter o with stroke"), despite the descriptive name, doesn't have a decomposition (= it's a full letter). (Though both Chris' dotted letters do decompose so their dots should go away under that scheme.)
The documentation says "If diacritics is insensitive, characters with and without diacritics (umlauts, characters with accents) are declared as identical." without defining diacritics. So far as I can tell from a quick check, the XFTQ spec doesn't say what a diacritic is, either, it just says there's a collation defined for the purpose and maybe it can't always cope algorithmically.
I'd like to advocate for an equivalent to the "decomposed normal form, strip the non-spacing modifier characters, recompose to composed normal form" equivalence because at least that one is plausibly well understood. If that's not it, can we get the collation in the documentation somewhere?
-- Graydon, who hopes that made sense
On Sun, Nov 23, 2014 at 1:29 PM, Chris Yocum cyocum@gmail.com wrote:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA256
Hi Chrsitian
Thanks for letting me know! I also need ḟ U+1E1F.
All the best, Chris
On Sun, Nov 23, 2014 at 06:22:39PM +0100, Christian Grün wrote:
Hi Chris,
Thanks for the observation. I can confirm that some characters like ṡ (U+1E61) do not seem be properly normalized yet. I have added an issue for that [1], and I hope I will soon have it fixed.
If you encounter some other surprising behavior like this, feel free to tell us.
Best, Christian
[1] https://github.com/BaseXdb/basex/issues/1029
Hi Everyone,
I am rather confused again about diacritic handling in basex. For instance, with Full Text turned on a word like athgabáil will match both athgabail and athgabáil with "diacritics insensitive" which is what I would expect. However, if I try to match a word like cúachṡnaidm (with s with a dot above), it will not match without "diacritics sensitive" turned on in the query itself. I am rather confused why it would match in the case of athgabáil but not in the case of cúachṡnaidm. Does anyone know why this is happening and how to make it match like the other match?
All the best, Chris -----BEGIN PGP SIGNATURE----- Version: GnuPG v1
iF4EAREIAAYFAlRyFScACgkQDjE+CSbP7HogoQD/eyQT0ioILJv3KHL1cKyNi+Ht a8s0g3Az5o6Uvnniw50A/Reyycsn8YiIY3naEOPuwKydBOUwrYKSe5fjOTqtZyXl =jycQ -----END PGP SIGNATURE-----
-----BEGIN PGP SIGNATURE----- Version: GnuPG v1
iF4EAREIAAYFAlRyJ6YACgkQDjE+CSbP7HpN8AEAnkcKGXtUxGJzWIsaTsdDhvbS NC44Gc2qp04bsyc6YCIBAKeVf1CBbWnGl9cs4RUe3tPqPSv0T0FNIIQ0/45dUBfM =mfHa -----END PGP SIGNATURE-----