Hi Christian,
On 2012-06-23, Christian Grün christian.gruen@gmail.com wrote:
Yes, this is reinventing the wheel, and my impression is that Token.java is also a reinvented wheel, and I'm sorry to say that it doesn't look very round to me.
I completely agree; the more languages we support, the more reasonable it seems to resort to Java’s existing Unicode framework.
I'm glad you agree. The following is meant as constructive criticism, even if it may sound quite harsh.
Performance was the main reason why we initially introduced our own mappings and algorithms. As we internally keep all strings as UTF8 byte arrays (i.e., in the same way as they are stored on disk), we can often avoid converting bytes to Strings and vice versa, as this can get quite costly if it's done millions of times. This is also one of the reason why our existing full-text algorithms are more efficient than some competing implementations.
Well, one may also say you're cutting corners... As a computer scientist, I think BaseX is great, and that's also why I'm following the mailing list even though I'm not currently using BaseX. Why not? Because I'm also a digital humanities researcher. I'm working with narrative-oriented documents, but BaseX currently has a very strong bias towards record-oriented Latin-1 documents, so strong in fact, that it is, IMHO, not suited for narrative-oriented documents.
AFAIK, many of the issues that make BaseX currently unsuited for narrative-oriented documents (and thus the TEI community in general) are explicit design decisions, for example, chopping whitespace by default, not supporting the full-text search Ignore option, and using a limited implementation of Unicode.
From my point of view, this is unfortunate: if the full-text search
doesn't work for my documents, it doesn't help that it is fast. The same goes for Unicode support: if BaseX doesn't support combining characters (which are not only needed for our historical texts), it doesn't help it's fast for some restricted set of characters. It's nice "α" now matches "ά" (U+03AC) as precomposed character--but it still doesn't match "ά" (U+03B1 + U+0301) using a combining accent, even though the two encodings are canonically equivalent as per the Unicode Standard. The same goes for "ä" and "ä" and an infinite number of further combinations. When you create collections from different sources, you will probably end up with documents using precomposed characters *and* combining characters. The difference is only the encoding, it will look identical when displayed to users, who expect (and rightly so) to find all instances of "alpha with tonos" or "a with diaeresis," regardless of the encoding.
However, due to the increasing demands BaseX is confronted with, we'll probably need to make some more compromises here, or – even better – find a hybrid solution (e.g., avoid expensive algorithms whenever a string is known to have only standard characters). Our existing full-text architecture actually provides more than enough features when it comes to the majority of our existing projects, but it's rather poor from a linguistic perspective.
I'd say, the problem is not so much the linguistics, but the language-independent Unicode support, which would give you a basic level of functionality for all languages.
Best regards