Hi Chris,
I am glad to report that the latest snapshot of BaseX [1] now provides much better support for diacritical characters.
Please find more details in my next mail to Graydon.
Hope this helps, Christian
[1] http://files.basex.org/releases/latest/ __________________________________________
On Sun, Nov 23, 2014 at 11:56 PM, Graydon Saunders graydonish@gmail.com wrote:
Hi Christian --
That is indeed a glorious table! :)
Unicode defines whether or not a character has a decomposition; so e-with-acute, U+00E9, decomposes into U+0065 + U+0301 (an e and a combining acute accent.) I think the presence of a decomposition is a recoverable character property in Java. (it is in Perl. :)
U+0386, "Greek Capital Alpha With Tonos", has a decomposition, so the combining accute accent -- U+0301 again! -- would strip.
If one is going to go all strict-and-high-church Unicode, "diacritic" is "anything that decomposes into a combining (that is, non-spacing) character code point when considering the decomposed normal form (NFD or NFKD in the Unicode spec). This would NOT convert U+00DF, "latin small letter sharp s", into ss, because per the Unicode Consortium, sharp s is a full letter, rather than a modified s. (Same with thorn not decomposing into th, and so on for other things that are considered full letters, which can get surprising in the Scandinavian dotted A's and such.) The disadvantage is that users of BaseX might expect the compare to work; that advantage is the arbitrarily large number of arguments, headaches, and natural language edge cases can be shifted off to the Unicode guys by saying "we're following the Unicode character category rules".
It also gives something that can be pointed to as an explanation and works like the existing normalized-unicode functions. This is not the same as saying it's easy to understand but it's something.
How you do it efficiently, well, my knowledge of Java would probably fit on the bottom of your shoe. On the plus side, Java regular expressions support the \p{...} Unicode character category syntax so it's got to be in there somewhere. I'd think there's an efficient way to load the huge horrible table once, and then filter the characters by property -- if this character has got a decomposition, you then want the members of the decomposition that have the Unicode property Character.isUnicodeIdentifierStart() returning true comes to mind as something that might work.
Did that make sense?
-- Graydon
On Sun, Nov 23, 2014 at 5:19 PM, Christian Grün christian.gruen@gmail.com wrote:
Hi Graydon,
I just had a look. In BaseX, "without diacritics" can be explained by this a single, glorious mapping table [1].
It's quite obvious that there are just too many cases which are not covered by this mapping. We introduced this solution in the very beginnings of our full-text implementation, and I am just surprised that it survived for such a long time, probably because it was sufficient for most use cases our users came across so far.
However, I would like to extend the current solution with something more general and, still, more efficient than full Unicode normalizations (performance-wise, the current mapping is probably difficult to beat). As you already indicated, the XQFT spec left it to the implementers to decide what diacritics are.
I'd like to advocate for an equivalent to the "decomposed normal form, strip the non-spacing modifier characters, recompose to composed normal form" equivalence because at least that one is plausibly well understood.
Shame on me; could you give me some quick tutoring what this would mean?… Would accepts and dots from German umlauts, and other characters in the range of \C380-\C3BF, be stripped as well by that recomposition? And just in case you know more about it: What happens with characters like the German "ß" that is typically rewritten to two characters ("ss")?
Thanks, Christian
[1] https://github.com/BaseXdb/basex/blob/master/basex-core/src/main/java/org/ba...