Re: [basex-talk] full text search collation

23 Jun 2012


      Hi Christian,
On 2012-06-23, Christian Grün christian.gruen@gmail.com wrote:
...
...
Yes, this is reinventing the wheel, and my impression is that Token.java
is also a reinvented wheel, and I'm sorry to say that it doesn't look
very round to me.
I completely agree; the more languages we support, the more reasonable
it seems to resort to Java’s existing Unicode framework.
I'm glad you agree.  The following is meant as constructive criticism,
even if it may sound quite harsh.
...
Performance was the main reason why we initially introduced our own
mappings and algorithms. As we internally keep all strings as UTF8
byte arrays (i.e., in the same way as they are stored on disk), we can
often avoid converting bytes to Strings and vice versa, as this can
get quite costly if it's done millions of times. This is also one of
the reason why our existing full-text algorithms are more efficient
than some competing implementations.
Well, one may also say you're cutting corners...  As a computer
scientist, I think BaseX is great, and that's also why I'm following the
mailing list even though I'm not currently using BaseX.  Why not?
Because I'm also a digital humanities researcher.  I'm working with
narrative-oriented documents, but BaseX currently has a very strong bias
towards record-oriented Latin-1 documents, so strong in fact, that it
is, IMHO, not suited for narrative-oriented documents.
AFAIK, many of the issues that make BaseX currently unsuited for
narrative-oriented documents (and thus the TEI community in general) are
explicit design decisions, for example, chopping whitespace by default,
not supporting the full-text search Ignore option, and using a limited
implementation of Unicode.
...
From my point of view, this is unfortunate: if the full-text search
doesn't work for my documents, it doesn't help that it is fast.  The
same goes for Unicode support: if BaseX doesn't support combining
characters (which are not only needed for our historical texts), it
doesn't help it's fast for some restricted set of characters.  It's nice
"α" now matches "ά" (U+03AC) as precomposed character--but it still
doesn't match "ά" (U+03B1 + U+0301) using a combining accent, even
though the two encodings are canonically equivalent as per the Unicode
Standard.  The same goes for "ä" and "ä" and an infinite number of
further combinations.  When you create collections from different
sources, you will probably end up with documents using precomposed
characters *and* combining characters.  The difference is only the
encoding, it will look identical when displayed to users, who expect
(and rightly so) to find all instances of "alpha with tonos" or "a with
diaeresis," regardless of the encoding.
...
However, due to the increasing demands BaseX is confronted with, we'll
probably need to make some more compromises here, or – even better –
find a hybrid solution (e.g., avoid expensive algorithms whenever a
string is known to have only standard characters). Our existing
full-text architecture actually provides more than enough features
when it comes to the majority of our existing projects, but it's
rather poor from a linguistic perspective.
I'd say, the problem is not so much the linguistics, but the
language-independent Unicode support, which would give you a basic level
of functionality for all languages.
Best regards
-- 
Dr.-Ing. Michael Piotrowski, M.A. mxp@cl.uzh.ch
Institute of Computational Linguistics, University of Zurich
Phone +41 44 63-54313 | OpenPGP public key ID 0x1614A044
* OUT NOW: Systems and Frameworks for Computational Morphology
*          http://www.springeronline.com/978-3-642-23137-7

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

Re: [basex-talk] full text search collation