Re: [basex-talk] More Diacritic Questions

23 Nov 2014


      I just found a mapping table proposed by John Cowan [1]. It's already
pretty old, so it doesn't cover newer Unicode versions, but it's
surely better than our current solution.
[1] http://www.unicode.org/mail-arch/unicode-ml/y2003-m08/0047.html
On Sun, Nov 23, 2014 at 11:19 PM, Christian Grün
christian.gruen@gmail.com wrote:
...
Hi Graydon,
I just had a look. In BaseX, "without diacritics" can be explained by
this a single, glorious mapping table [1].
It's quite obvious that there are just too many cases which are not
covered by this mapping. We introduced this solution in the very
beginnings of our full-text implementation, and I am just surprised
that it survived for such a long time, probably because it was
sufficient for most use cases our users came across so far.
However, I would like to extend the current solution with something
more general and, still, more efficient than full Unicode
normalizations (performance-wise, the current mapping is probably
difficult to beat). As you already indicated, the XQFT spec left it to
the implementers to decide what diacritics are.
...
I'd like to advocate for an equivalent to the "decomposed normal form,
strip the non-spacing modifier characters, recompose to composed
normal form" equivalence because at least that one is plausibly well
understood.
Shame on me; could you give me some quick tutoring what this would
mean?… Would accepts and dots from German umlauts, and other
characters in the range of \C380-\C3BF, be stripped as well by that
recomposition? And just in case you know more about it: What happens
with characters like the German "ß" that is typically rewritten to two
characters ("ss")?
Thanks,
Christian
[1] https://github.com/BaseXdb/basex/blob/master/basex-core/src/main/java/org/ba...

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

Re: [basex-talk] More Diacritic Questions