Hi Graydon,
I just had a look. In BaseX, "without diacritics" can be explained by this a single, glorious mapping table [1].
It's quite obvious that there are just too many cases which are not covered by this mapping. We introduced this solution in the very beginnings of our full-text implementation, and I am just surprised that it survived for such a long time, probably because it was sufficient for most use cases our users came across so far.
However, I would like to extend the current solution with something more general and, still, more efficient than full Unicode normalizations (performance-wise, the current mapping is probably difficult to beat). As you already indicated, the XQFT spec left it to the implementers to decide what diacritics are.
I'd like to advocate for an equivalent to the "decomposed normal form, strip the non-spacing modifier characters, recompose to composed normal form" equivalence because at least that one is plausibly well understood.
Shame on me; could you give me some quick tutoring what this would mean?… Would accepts and dots from German umlauts, and other characters in the range of \C380-\C3BF, be stripped as well by that recomposition? And just in case you know more about it: What happens with characters like the German "ß" that is typically rewritten to two characters ("ss")?
Thanks, Christian
[1] https://github.com/BaseXdb/basex/blob/master/basex-core/src/main/java/org/ba...