full text search collation

List overview All Threads
Download

newer

older

log files

Validation Module

alxarch

22 Jun 2012 22 Jun '12

3:05 p.m.

Hi I have a question relating to full text search. I am trying to use it with Greek texts and the default collation assumes that ά != α In short, accented vowels are treated as different letters. Is this something that has to do with the collation used not being Greek, or does it have something to do with the tokenizer?

As a side question i noticed that the stemmers used from lucne are quite outdated. 3.6.0 also includes a Greek stemmer. I tried to include the 3.6.0 stemmers instead but language codes seem to be hardcoded in util/ft/Language.java Any chnce of that part of the code being updated to use the latest stemmers so more languages can be integrated? (i am not proficient in java myself unfortunately so i can't directly help)

thanks, Alex

Attachments:

attachment.html (text/html — 804 bytes)

Show replies by date

Christian Grün

22 Jun 22 Jun

3:47 p.m.

Hi Alex,

thanks for your mail.

...

I am trying to use it with Greek texts and the default collation assumes that ά != α

As you correctly guessed, our tokenizer is not tailored (yet) for Greek text corpora. To speed up things, we are using a simple static Unicode mapping for character normalizations..

https://github.com/BaseXdb/basex/blob/master/src/main/java/org/basex/util/To...

If you'd manage to provide me with some appropriate tables for Greek characters, I'll be glad to extend this mapping.

...

As a side question i noticed that the stemmers used from lucne are quite outdated. 3.6.0 also includes a Greek stemmer. I tried to include the 3.6.0 stemmers instead but language codes seem to be hardcoded in util/ft/Language.java

Do you have a direct reference to your prefered Greek stemmer class? It will be easy for us to directly include it in our core package (the main advantage will be improved performance)..

Hope this helps, Christian

Charles Kowalski

4:40 p.m.

Hi Christian,

...

If you'd manage to provide me with some appropriate tables for Greek characters, I'll be glad to extend this mapping.

If i understood correctly i used the unicode codes from

http://unicode.org/charts/PDF/U0370.pdf

to produce the following mapping:

{'\u0390', 'ι'},

{'\u03b0', 'υ'},

{'\u03d3', 'Υ'},

{'\u03d4', 'Υ'},

{'\u0386', 'Α'},

{'\u0388', 'Ε'},

{'\u0389', 'Η'},

{'\u038a', 'Ι'},

{'\u03aa', 'Ι'},

{'\u03ca', 'ι'},

{'\u03ab', 'Υ'},

{'\u03cb', 'υ'},

{'\u038c', 'Ο'},

{'\u03ac', 'α'},

{'\u03cc', 'ο'},

{'\u03ad', 'ε'},

{'\u03cd', 'υ'},

{'\u038e', 'Υ'},

{'\u03ae', 'η'},

{'\u03ce', 'ω'},

{'\u038f', 'Ω'},

{'\u03af', 'ι'},

I am concerned though because this is not always the desired behavior. Sometimes (ie in an academic context) I could see the need for accent-sensitive searches. The optimal scenario would be to have these mappings in an easily parsable text file (just like stopword list behaves). OTOH this would be reinventing the collation wheel (an oversimplified version of it) For example the mapping above only covers modern Greek text. Ancient/Polytonic Greek has a lot more mappings that are not needed for modern greek. Also I'm pretty sure other languages have such needs too.

I'd like to hear your thoughts on this.

...

Do you have a direct reference to your prefered Greek stemmer class?

The specific class I am referring to is:

http://lucene.apache.org/core/3_6_0/api/all/org/apache/lucene/analysis/el/Gr...

It is included in the lucene-3.6.0.tgz tarball under contrib/analyzers/common/lucene-analyzers-3.6.0.jar IIRC greek analyzer was introduced in 3.5.0 release hence it is not present in the 3.4.0 stemmers jar.

Thanks, alex

Michael Piotrowski

6:02 p.m.

On 2012-06-22, Charles Kowalski alxarch@gmail.com wrote:

...

I am concerned though because this is not always the desired behavior. Sometimes (ie in an academic context) I could see the need for accent-sensitive searches. The optimal scenario would be to have these mappings in an easily parsable text file (just like stopword list behaves). OTOH this would be reinventing the collation wheel (an oversimplified version of it) For example the mapping above only covers modern Greek text. Ancient/Polytonic Greek has a lot more mappings that are not needed for modern greek. Also I'm pretty sure other languages have such needs too.

I'd like to hear your thoughts on this.

Yes, this is reinventing the wheel, and my impression is that Token.java is also a reinvented wheel, and I'm sorry to say that it doesn't look very round to me.

Unicode provides all the tools to make reinventing wheels unnecessary, in particular normalization forms, character properties, and the Unicode collation algorithm. These tools are already available in Java (and many other languages) and cover *all* of Unicode, not just small subsets. They implement sensible defaults for most cases in different scenarios. Of course it should be possible to override the defaults for applications with special needs, but for most applications it shouldn't be necessary.

For example, no tables are necessary for stripping accents. Instead, you apply NFD (or NFKD) normalization (decomposing all characters) and then remove all characters with the property Mn (the accents).

The NFKD normalization form also allows you to match "ſ" with "s", "ﬀ" with "ff", "²" with "2", etc.

Please consult the Unicode standard, it's all there, so *please* don't try to invent new wheels.

Best regards

-- Dr.-Ing. Michael Piotrowski, M.A. mxp@cl.uzh.ch Institute of Computational Linguistics, University of Zurich Phone +41 44 63-54313 | OpenPGP public key ID 0x1614A044 * OUT NOW: Systems and Frameworks for Computational Morphology * http://www.springeronline.com/978-3-642-23137-7

Christian Grün

7:16 p.m.

Hi Michael,

...

Yes, this is reinventing the wheel, and my impression is that Token.java is also a reinvented wheel, and I'm sorry to say that it doesn't look very round to me.

I completely agree; the more languages we support, the more reasonable it seems to resort to Java’s existing Unicode framework.

Performance was the main reason why we initially introduced our own mappings and algorithms. As we internally keep all strings as UTF8 byte arrays (i.e., in the same way as they are stored on disk), we can often avoid converting bytes to Strings and vice versa, as this can get quite costly if it's done millions of times. This is also one of the reason why our existing full-text algorithms are more efficient than some competing implementations.

However, due to the increasing demands BaseX is confronted with, we'll probably need to make some more compromises here, or – even better – find a hybrid solution (e.g., avoid expensive algorithms whenever a string is known to have only standard characters). Our existing full-text architecture actually provides more than enough features when it comes to the majority of our existing projects, but it's rather poor from a linguistic perspective.

Thanks for pointing out the Mn property; I'll keep that in mind. Christian

PS: If anyone reading this is interested in making our full-text architecture more flexible and international… Just tell us. Once more, I'd like to thank Toshio HIRAI for integrating our Japanese tokenization and stemming code (http://docs.basex.org/wiki/Full-Text:_Japanese)!

Michael Piotrowski

23 Jun 23 Jun

10:05 a.m.

Hi Christian,

On 2012-06-23, Christian Grün christian.gruen@gmail.com wrote:

...

...
Yes, this is reinventing the wheel, and my impression is that Token.java is also a reinvented wheel, and I'm sorry to say that it doesn't look very round to me.

I completely agree; the more languages we support, the more reasonable it seems to resort to Java’s existing Unicode framework.

I'm glad you agree. The following is meant as constructive criticism, even if it may sound quite harsh.

...

Performance was the main reason why we initially introduced our own mappings and algorithms. As we internally keep all strings as UTF8 byte arrays (i.e., in the same way as they are stored on disk), we can often avoid converting bytes to Strings and vice versa, as this can get quite costly if it's done millions of times. This is also one of the reason why our existing full-text algorithms are more efficient than some competing implementations.

Well, one may also say you're cutting corners... As a computer scientist, I think BaseX is great, and that's also why I'm following the mailing list even though I'm not currently using BaseX. Why not? Because I'm also a digital humanities researcher. I'm working with narrative-oriented documents, but BaseX currently has a very strong bias towards record-oriented Latin-1 documents, so strong in fact, that it is, IMHO, not suited for narrative-oriented documents.

AFAIK, many of the issues that make BaseX currently unsuited for narrative-oriented documents (and thus the TEI community in general) are explicit design decisions, for example, chopping whitespace by default, not supporting the full-text search Ignore option, and using a limited implementation of Unicode.

...

From my point of view, this is unfortunate: if the full-text search

doesn't work for my documents, it doesn't help that it is fast. The same goes for Unicode support: if BaseX doesn't support combining characters (which are not only needed for our historical texts), it doesn't help it's fast for some restricted set of characters. It's nice "α" now matches "ά" (U+03AC) as precomposed character--but it still doesn't match "ά" (U+03B1 + U+0301) using a combining accent, even though the two encodings are canonically equivalent as per the Unicode Standard. The same goes for "ä" and "ä" and an infinite number of further combinations. When you create collections from different sources, you will probably end up with documents using precomposed characters *and* combining characters. The difference is only the encoding, it will look identical when displayed to users, who expect (and rightly so) to find all instances of "alpha with tonos" or "a with diaeresis," regardless of the encoding.

...

However, due to the increasing demands BaseX is confronted with, we'll probably need to make some more compromises here, or – even better – find a hybrid solution (e.g., avoid expensive algorithms whenever a string is known to have only standard characters). Our existing full-text architecture actually provides more than enough features when it comes to the majority of our existing projects, but it's rather poor from a linguistic perspective.

I'd say, the problem is not so much the linguistics, but the language-independent Unicode support, which would give you a basic level of functionality for all languages.

Best regards

Christian Grün

10:38 a.m.

Thanks for your feedback. In a nutshell: yes, it's quite a challenge to satisfy the wild range of scenarios BaseX is used for, which is why we need to set priorities, and can't do justice to all users. As a matter of fact, the project is Open Source, and contributions are welcome and needed (as long as features are not financially sponsored).

Btw, what's your opinion on the Lucene tokenizers and stemmers? As you may know, they also focus on performance. They also bypass Java's Unicode normalization algorithms and do everything by themselves, which is we they may be more relevant to us than the standard Java libraries. ____________________________________

...

...
Performance was the main reason why we initially introduced our own mappings and algorithms. As we internally keep all strings as UTF8 byte arrays (i.e., in the same way as they are stored on disk), we can often avoid converting bytes to Strings and vice versa, as this can get quite costly if it's done millions of times. This is also one of the reason why our existing full-text algorithms are more efficient than some competing implementations.

Well, one may also say you're cutting corners... As a computer scientist, I think BaseX is great, and that's also why I'm following the mailing list even though I'm not currently using BaseX. Why not? Because I'm also a digital humanities researcher. I'm working with narrative-oriented documents, but BaseX currently has a very strong bias towards record-oriented Latin-1 documents, so strong in fact, that it is, IMHO, not suited for narrative-oriented documents.

AFAIK, many of the issues that make BaseX currently unsuited for narrative-oriented documents (and thus the TEI community in general) are explicit design decisions, for example, chopping whitespace by default, not supporting the full-text search Ignore option, and using a limited implementation of Unicode.

From my point of view, this is unfortunate: if the full-text search doesn't work for my documents, it doesn't help that it is fast. The same goes for Unicode support: if BaseX doesn't support combining characters (which are not only needed for our historical texts), it doesn't help it's fast for some restricted set of characters. It's nice "α" now matches "ά" (U+03AC) as precomposed character--but it still doesn't match "ά" (U+03B1 + U+0301) using a combining accent, even though the two encodings are canonically equivalent as per the Unicode Standard. The same goes for "ä" and "ä" and an infinite number of further combinations. When you create collections from different sources, you will probably end up with documents using precomposed characters *and* combining characters. The difference is only the encoding, it will look identical when displayed to users, who expect (and rightly so) to find all instances of "alpha with tonos" or "a with diaeresis," regardless of the encoding.

...
However, due to the increasing demands BaseX is confronted with, we'll probably need to make some more compromises here, or – even better – find a hybrid solution (e.g., avoid expensive algorithms whenever a string is known to have only standard characters). Our existing full-text architecture actually provides more than enough features when it comes to the majority of our existing projects, but it's rather poor from a linguistic perspective.

I'd say, the problem is not so much the linguistics, but the language-independent Unicode support, which would give you a basic level of functionality for all languages.

Best regards

-- Dr.-Ing. Michael Piotrowski, M.A. mxp@cl.uzh.ch Institute of Computational Linguistics, University of Zurich Phone +41 44 63-54313 | OpenPGP public key ID 0x1614A044

OUT NOW: Systems and Frameworks for Computational Morphology

http://www.springeronline.com/978-3-642-23137-7

BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk

Charles Kowalski

24 Jun 24 Jun

5:51 a.m.

I haven't had time to try the latest snapshot yet. Will do tomorrow.

The algorithm of the greek lucene stemmer can be found as js here: http://people.dsv.su.se/~hercules/greek_stemmer.gr.html http://people.dsv.su.se/%7Ehercules/greek_stemmer.gr.html The logic seems quite simplistic 7step regex processing.

My vote would be the lucene libraries too. They would provide a middle-ground between performance and feature-completeness and if the integration was kept up to date with the latest version any new languages would be 'free' feature upgrades for basex. The language coverage seems to be quite extensive already.

alex

Michael Piotrowski

26 Jun 26 Jun

4:15 a.m.

On 2012-06-23, Christian Grün christian.gruen@gmail.com wrote:

...

Thanks for your feedback. In a nutshell: yes, it's quite a challenge to satisfy the wild range of scenarios BaseX is used for, which is why we need to set priorities, and can't do justice to all users.

I agree. I believe, however, that everybody would benefit from good Unicode support :-)

...

As a matter of fact, the project is Open Source, and contributions are welcome and needed (as long as features are not financially sponsored).

I'm an open-source author myself, so I know what you mean. Unfortunately, I currently don't have any capacities left for contributing code to BaseX.

...

Btw, what's your opinion on the Lucene tokenizers and stemmers? As you may know, they also focus on performance. They also bypass Java's Unicode normalization algorithms and do everything by themselves, which is we they may be more relevant to us than the standard Java libraries.

My understanding is that this is mostly for historical reasons, not (primarily) for performance reasons; and some of them are probably hacks and workarounds. Lucene's support for Unicode has a number of problems, and I think they're now moving towards the use of ICU [1], see, e.g.,

http://2010.lucene-eurocon.org/sessions-track1-day1.html#2

Some of this is already available in contrib:

https://issues.apache.org/jira/browse/LUCENE-1488 http://lucene.apache.org/core/3_6_0/api/contrib-icu/index.html

I think it would be a good idea for BaseX to have a look at ICU.

Best regards

Footnotes: [1] http://site.icu-project.org/

Christian Grün

5:52 a.m.

Hi Michael, thanks for your links, I'll keep them in mind. Christian

PS to everyone: if you believe that better Unicode support is a major concern for you, feel free to raise your hands. ___________________________

...

...
Thanks for your feedback. In a nutshell: yes, it's quite a challenge to satisfy the wild range of scenarios BaseX is used for, which is why we need to set priorities, and can't do justice to all users.

I agree. I believe, however, that everybody would benefit from good Unicode support :-)

...
As a matter of fact, the project is Open Source, and contributions are welcome and needed (as long as features are not financially sponsored).

I'm an open-source author myself, so I know what you mean. Unfortunately, I currently don't have any capacities left for contributing code to BaseX.

...
Btw, what's your opinion on the Lucene tokenizers and stemmers? As you may know, they also focus on performance. They also bypass Java's Unicode normalization algorithms and do everything by themselves, which is we they may be more relevant to us than the standard Java libraries.

My understanding is that this is mostly for historical reasons, not (primarily) for performance reasons; and some of them are probably hacks and workarounds. Lucene's support for Unicode has a number of problems, and I think they're now moving towards the use of ICU [1], see, e.g.,

http://2010.lucene-eurocon.org/sessions-track1-day1.html#2

Some of this is already available in contrib:

https://issues.apache.org/jira/browse/LUCENE-1488 http://lucene.apache.org/core/3_6_0/api/contrib-icu/index.html

I think it would be a good idea for BaseX to have a look at ICU.

Best regards

Footnotes: [1] http://site.icu-project.org/

-- Dr.-Ing. Michael Piotrowski, M.A. mxp@cl.uzh.ch Institute of Computational Linguistics, University of Zurich Phone +41 44 63-54313 | OpenPGP public key ID 0x1614A044

OUT NOW: Systems and Frameworks for Computational Morphology

http://www.springeronline.com/978-3-642-23137-7

BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk

Christian Grün

22 Jun 22 Jun

6:44 p.m.

Hi Alex,

...

If i understood correctly i used the unicode codes from http://unicode.org/charts/PDF/U0370.pdf to produce the following mapping: [...]

thanks; I have added your mappings and uploaded a new stable snapshot [1,2]. The following query should now return true:

"ά" contains text "α"

Next, I've added the Greek stemmer to our internal implementations. It can be invoked by setting "stemming" and "language"; e.g.:

"..." contains text "..." using stemming using language "el"

Due to my non-existing Greek language skills, I'm sorry I had no chance to perform any tests.. your feedback is welcome!

...

I am concerned though because this is not always the desired behavior. Sometimes (ie in an academic context) I could see the need for accent-sensitive searches.

In this particular case, you can switch off the removal of diacritics via..

"ά" contains text "α" using diacritics sensitive

...

OTOH this would be reinventing the collation wheel (an oversimplified version of it)

That's true. I'll write some more on that as a reply to Michael’s mail. Christian

[1] http://docs.basex.org/wiki/Releases [2] http://files.basex.org/releases/latest/

Σιγάλας Αλέξανδρος

8:27 p.m.

Thanks a lot for swiftly attending to this request I will try it first thing in the morning and let you know of the results

-- Sent from my Android phone with K-9 Mail. Please excuse my brevity. "Christian Grün" christian.gruen@gmail.com wrote: Hi Alex, > If i understood correctly i used the unicode codes from > http://unicode.org/charts/PDF/U0370.pdf > to produce the following mapping: > [...] thanks; I have added your mappings and uploaded a new stable snapshot [1,2]. The following query should now return true: "ά" contains text "α" Next, I've added the Greek stemmer to our internal implementations. It can be invoked by setting "stemming" and "language"; e.g.: "..." contains text "..." using stemming using language "el" Due to my non-existing Greek language skills, I'm sorry I had no chance to perform any tests.. your feedback is welcome! > I am concerned though because this is not always the desired behavior. > Sometimes (ie in an academic context) I could see the need for > accent-sensitive searches. In this particular case, you can switch off the removal of diacritics via.. "ά" contains text "α" using diacritics sensitive > OTOH this would be reinventing the collation wheel (an oversimplified > version of it) That's true. I'll write some more on that as a reply to Michael’s mail. Christian [1] http://docs.basex.org/wiki/Releases [2] http://files.basex.org/releases/latest/

Charles Kowalski

23 Jun 23 Jun

5:44 a.m.

Hi Christian,

I tried the newest version and it works fine for accents. The stemmer OTOH does not seem to be working. I think it needs to be integrated in the same way that the other lucene stemmers are integrated, using the whole lucene-analyzers-3.6.0.jar instead of the lucene-stemmers-3.4.0.jar. This is assuming that all class interfaces are kept unchanged between 3.4.0/3.6.0 lucene versions. Then again I'm no good in Java. Just talking from a generic oo-programming point.

Thanks Alex

Christian Grün

8:24 a.m.

Hi Αλέξανδρος,

...

The stemmer OTOH does not seem to be working. I think it needs to be integrated in the same way that the other lucene stemmers are integrated, using the whole lucene-analyzers-3.6.0.jar instead of the lucene-stemmers-3.4.0.jar.

Thanks for your feedback; I already guessed that this might take a little bit more time. Could you provide us with some simple example queries and their expected result? Similar to..

"ά" contains text "α" → true "..." contains text "..." using stemming using language "el" → ...

Thanks in advance, Christian

Christian Grün

8:46 a.m.

A little update: I noticed that lucene-analyzers-3.6.0 (in particular the Greek stemmer) is not self-contained, as it has many dependenciens to other Lucene packages. As we want to avoid including dependencies to the complete Lucene distribution, I'll try once more to extract the relevant classes and embed them to our core. Your example queries are welcome!

...

Thanks for your feedback; I already guessed that this might take a little bit more time. Could you provide us with some simple example queries and their expected result? Similar to..

"ά" contains text "α" → true "..." contains text "..." using stemming using language "el" → ...

Thanks in advance, Christian

Christian Grün

9:58 a.m.

I noticed a minor bug in my Greek stemmer implementation. After removing two characters in the code, queries such as the following one..

"ΧΑΡΑΚΤΗΡΕΣ" contains text "χαρακτηρ" using stemming using language 'el'

..should now return the same results as the Lucene stemmer. Just try the latest snapshot. Christian

PS: by the way, I noticed that Lucene also avoids Java's Unicode normalization and has its custom character mappings – most probably to improve performance. The following class is triggered by the Greek stemmer implementation:

http://www.docjar.com/html/api/org/apache/lucene/analysis/el/GreekLowerCaseF...

___________________________________

On Sat, Jun 23, 2012 at 2:24 PM, Christian Grün christian.gruen@gmail.com wrote:

...

Hi Αλέξανδρος,

...
The stemmer OTOH does not seem to be working. I think it needs to be integrated in the same way that the other lucene stemmers are integrated, using the whole lucene-analyzers-3.6.0.jar instead of the lucene-stemmers-3.4.0.jar.

Thanks for your feedback; I already guessed that this might take a little bit more time. Could you provide us with some simple example queries and their expected result? Similar to..

"ά" contains text "α" → true "..." contains text "..." using stemming using language "el" → ...

Thanks in advance, Christian

4770

Age (days ago)

4774

Last active (days ago)

basex-talk@mailman.uni-konstanz.de

15 comments

5 participants

tags (0)

participants (5)

alxarch
Charles Kowalski
Christian Grün
Michael Piotrowski
Σιγάλας Αλέξανδρος