I just found a mapping table proposed by John Cowan [1]. It's already pretty old, so it doesn't cover newer Unicode versions, but it's surely better than our current solution.
[1] http://www.unicode.org/mail-arch/unicode-ml/y2003-m08/0047.html
On Sun, Nov 23, 2014 at 11:19 PM, Christian Grün christian.gruen@gmail.com wrote:
Hi Graydon,
I just had a look. In BaseX, "without diacritics" can be explained by this a single, glorious mapping table [1].
It's quite obvious that there are just too many cases which are not covered by this mapping. We introduced this solution in the very beginnings of our full-text implementation, and I am just surprised that it survived for such a long time, probably because it was sufficient for most use cases our users came across so far.
However, I would like to extend the current solution with something more general and, still, more efficient than full Unicode normalizations (performance-wise, the current mapping is probably difficult to beat). As you already indicated, the XQFT spec left it to the implementers to decide what diacritics are.
I'd like to advocate for an equivalent to the "decomposed normal form, strip the non-spacing modifier characters, recompose to composed normal form" equivalence because at least that one is plausibly well understood.
Shame on me; could you give me some quick tutoring what this would mean?… Would accepts and dots from German umlauts, and other characters in the range of \C380-\C3BF, be stripped as well by that recomposition? And just in case you know more about it: What happens with characters like the German "ß" that is typically rewritten to two characters ("ss")?
Thanks, Christian
[1] https://github.com/BaseXdb/basex/blob/master/basex-core/src/main/java/org/ba...