On Mon, Nov 24, 2014 at 1:13 AM, Christian Grün <christian.gruen@gmail.com> wrote:

Hi Chris,

I am glad to report that the latest snapshot of BaseX [1] now provides
much better support for diacritical characters.

Please find more details in my next mail to Graydon.

Hope this helps,
Christian

[1] http://files.basex.org/releases/latest/
__________________________________________

On Sun, Nov 23, 2014 at 11:56 PM, Graydon Saunders <graydonish@gmail.com> wrote:
> Hi Christian --
>
> That is indeed a glorious table! :)
>
> Unicode defines whether or not a character has a decomposition; so
> e-with-acute, U+00E9, decomposes into U+0065 + U+0301 (an e and a
> combining acute accent.) I think the presence of a decomposition is a
> recoverable character property in Java. (it is in Perl. :)
>
> U+0386, "Greek Capital Alpha With Tonos", has a decomposition, so the
> combining accute accent -- U+0301 again! -- would strip.
>
> If one is going to go all strict-and-high-church Unicode, "diacritic"
> is "anything that decomposes into a combining (that is, non-spacing)
> character code point when considering the decomposed normal form (NFD
> or NFKD in the Unicode spec). This would NOT convert U+00DF, "latin
> small letter sharp s", into ss, because per the Unicode Consortium,
> sharp s is a full letter, rather than a modified s. (Same with thorn
> not decomposing into th, and so on for other things that are
> considered full letters, which can get surprising in the Scandinavian
> dotted A's and such.) The disadvantage is that users of BaseX might
> expect the compare to work; that advantage is the arbitrarily large
> number of arguments, headaches, and natural language edge cases can be
> shifted off to the Unicode guys by saying "we're following the Unicode
> character category rules".
>
> It also gives something that can be pointed to as an explanation and
> works like the existing normalized-unicode functions. This is not the
> same as saying it's easy to understand but it's something.
>
> How you do it efficiently, well, my knowledge of Java would probably
> fit on the bottom of your shoe. On the plus side, Java regular
> expressions support the \p{...} Unicode character category syntax so
> it's got to be in there somewhere. I'd think there's an efficient way
> to load the huge horrible table once, and then filter the characters
> by property -- if this character has got a decomposition, you then
> want the members of the decomposition that have the Unicode property
> Character.isUnicodeIdentifierStart() returning true comes to mind as
> something that might work.
>
> Did that make sense?
>
> -- Graydon
>
>
> On Sun, Nov 23, 2014 at 5:19 PM, Christian Grün
> <christian.gruen@gmail.com> wrote:
>> Hi Graydon,
>>
>> I just had a look. In BaseX, "without diacritics" can be explained by
>> this a single, glorious mapping table [1].
>>
>> It's quite obvious that there are just too many cases which are not
>> covered by this mapping. We introduced this solution in the very
>> beginnings of our full-text implementation, and I am just surprised
>> that it survived for such a long time, probably because it was
>> sufficient for most use cases our users came across so far.
>>
>> However, I would like to extend the current solution with something
>> more general and, still, more efficient than full Unicode
>> normalizations (performance-wise, the current mapping is probably
>> difficult to beat). As you already indicated, the XQFT spec left it to
>> the implementers to decide what diacritics are.
>>
>>> I'd like to advocate for an equivalent to the "decomposed normal form,
>>> strip the non-spacing modifier characters, recompose to composed
>>> normal form" equivalence because at least that one is plausibly well
>>> understood.
>>
>> Shame on me; could you give me some quick tutoring what this would
>> mean?… Would accepts and dots from German umlauts, and other
>> characters in the range of \C380-\C3BF, be stripped as well by that
>> recomposition? And just in case you know more about it: What happens
>> with characters like the German "ß" that is typically rewritten to two
>> characters ("ss")?
>>
>> Thanks,
>> Christian
>>
>> [1] https://github.com/BaseXdb/basex/blob/master/basex-core/src/main/java/org/basex/util/Token.java#L1420