Re: [basex-talk] More Diacritic Questions

23 Nov 2014


      Hi Chris,
I am glad to report that the latest snapshot of BaseX [1] now provides
much better support for diacritical characters.
Please find more details in my next mail to Graydon.
Hope this helps,
Christian
[1] http://files.basex.org/releases/latest/
__________________________________________
On Sun, Nov 23, 2014 at 11:56 PM, Graydon Saunders graydonish@gmail.com wrote:
...
Hi Christian --
That is indeed a glorious table! :)
Unicode defines whether or not a character has a decomposition; so
e-with-acute, U+00E9, decomposes into U+0065 + U+0301  (an e and a
combining acute accent.)  I think the presence of a decomposition is a
recoverable character property in Java.  (it is in Perl. :)
U+0386, "Greek Capital Alpha With Tonos", has a decomposition, so the
combining accute accent -- U+0301 again! -- would strip.
If one is going to go all strict-and-high-church Unicode, "diacritic"
is "anything that decomposes into a combining (that is, non-spacing)
character code point when considering the decomposed normal form (NFD
or NFKD in the Unicode spec).  This would NOT convert U+00DF, "latin
small letter sharp s", into ss, because per the Unicode Consortium,
sharp s is a full letter, rather than a modified s.  (Same with thorn
not decomposing into th, and so on for other things that are
considered full letters, which can get surprising in the Scandinavian
dotted A's and such.)  The disadvantage is that users of BaseX might
expect the compare to work; that advantage is the arbitrarily large
number of arguments, headaches, and natural language edge cases can be
shifted off to the Unicode guys by saying "we're following the Unicode
character category rules".
It also gives something that can be pointed to as an explanation and
works like the existing normalized-unicode functions.  This is not the
same as saying it's easy to understand but it's something.
How you do it efficiently, well, my knowledge of Java would probably
fit on the bottom of your shoe.  On the plus side, Java regular
expressions support the \p{...} Unicode character category syntax so
it's got to be in there somewhere.  I'd think there's an efficient way
to load the huge horrible table once, and then filter the characters
by property -- if this character has got a decomposition, you then
want the members of the decomposition that have the Unicode property
Character.isUnicodeIdentifierStart() returning true comes to mind as
something that might work.
Did that make sense?
-- Graydon
On Sun, Nov 23, 2014 at 5:19 PM, Christian Grün
christian.gruen@gmail.com wrote:
...
Hi Graydon,
I just had a look. In BaseX, "without diacritics" can be explained by
this a single, glorious mapping table [1].
It's quite obvious that there are just too many cases which are not
covered by this mapping. We introduced this solution in the very
beginnings of our full-text implementation, and I am just surprised
that it survived for such a long time, probably because it was
sufficient for most use cases our users came across so far.
However, I would like to extend the current solution with something
more general and, still, more efficient than full Unicode
normalizations (performance-wise, the current mapping is probably
difficult to beat). As you already indicated, the XQFT spec left it to
the implementers to decide what diacritics are.
...
I'd like to advocate for an equivalent to the "decomposed normal form,
strip the non-spacing modifier characters, recompose to composed
normal form" equivalence because at least that one is plausibly well
understood.
Shame on me; could you give me some quick tutoring what this would
mean?… Would accepts and dots from German umlauts, and other
characters in the range of \C380-\C3BF, be stripped as well by that
recomposition? And just in case you know more about it: What happens
with characters like the German "ß" that is typically rewritten to two
characters ("ss")?
Thanks,
Christian
[1] https://github.com/BaseXdb/basex/blob/master/basex-core/src/main/java/org/ba...

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

Re: [basex-talk] More Diacritic Questions