Hi Michael,
Yes, this is reinventing the wheel, and my impression is that Token.java is also a reinvented wheel, and I'm sorry to say that it doesn't look very round to me.
I completely agree; the more languages we support, the more reasonable it seems to resort to Java’s existing Unicode framework.
Performance was the main reason why we initially introduced our own mappings and algorithms. As we internally keep all strings as UTF8 byte arrays (i.e., in the same way as they are stored on disk), we can often avoid converting bytes to Strings and vice versa, as this can get quite costly if it's done millions of times. This is also one of the reason why our existing full-text algorithms are more efficient than some competing implementations.
However, due to the increasing demands BaseX is confronted with, we'll probably need to make some more compromises here, or – even better – find a hybrid solution (e.g., avoid expensive algorithms whenever a string is known to have only standard characters). Our existing full-text architecture actually provides more than enough features when it comes to the majority of our existing projects, but it's rather poor from a linguistic perspective.
Thanks for pointing out the Mn property; I'll keep that in mind. Christian
PS: If anyone reading this is interested in making our full-text architecture more flexible and international… Just tell us. Once more, I'd like to thank Toshio HIRAI for integrating our Japanese tokenization and stemming code (http://docs.basex.org/wiki/Full-Text:_Japanese)!