Hi Alex,
thanks for your mail.
I am trying to use it with Greek texts and the default collation assumes that ά != α
As you correctly guessed, our tokenizer is not tailored (yet) for Greek text corpora. To speed up things, we are using a simple static Unicode mapping for character normalizations..
https://github.com/BaseXdb/basex/blob/master/src/main/java/org/basex/util/To...
If you'd manage to provide me with some appropriate tables for Greek characters, I'll be glad to extend this mapping.
As a side question i noticed that the stemmers used from lucne are quite outdated. 3.6.0 also includes a Greek stemmer. I tried to include the 3.6.0 stemmers instead but language codes seem to be hardcoded in util/ft/Language.java
Do you have a direct reference to your prefered Greek stemmer class? It will be easy for us to directly include it in our core package (the main advantage will be improved performance)..
Hope this helps, Christian