Re: [basex-talk] full text search collation

22 Jun 2012


      Hi Michael,
...
Yes, this is reinventing the wheel, and my impression is that Token.java
is also a reinvented wheel, and I'm sorry to say that it doesn't look
very round to me.
I completely agree; the more languages we support, the more reasonable
it seems to resort to Java’s existing Unicode framework.
Performance was the main reason why we initially introduced our own
mappings and algorithms. As we internally keep all strings as UTF8
byte arrays (i.e., in the same way as they are stored on disk), we can
often avoid converting bytes to Strings and vice versa, as this can
get quite costly if it's done millions of times. This is also one of
the reason why our existing full-text algorithms are more efficient
than some competing implementations.
However, due to the increasing demands BaseX is confronted with, we'll
probably need to make some more compromises here, or – even better –
find a hybrid solution (e.g., avoid expensive algorithms whenever a
string is known to have only standard characters). Our existing
full-text architecture actually provides more than enough features
when it comes to the majority of our existing projects, but it's
rather poor from a linguistic perspective.
Thanks for pointing out the Mn property; I'll keep that in mind.
Christian
PS: If anyone reading this is interested in making our full-text
architecture more flexible and international… Just tell us. Once more,
I'd like to thank Toshio HIRAI for integrating our Japanese
tokenization and stemming code
(http://docs.basex.org/wiki/Full-Text:_Japanese)!

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

Re: [basex-talk] full text search collation