Dear Christian,
I’d be happy to chime in on the quality of basexs Chinese language full-text capabilities. Chinese sources are my primary research area. What exactly do you have in mind?
Greetings Duncan
Ceterum censeo exist-db.org esse conriganda
Today's Topics:
- Re: stemming chinese texts (Philippe Pons)
Message: 1 Date: Wed, 14 Oct 2020 12:30:59 +0200 From: Philippe Pons philippe.pons@college-de-france.fr To: basex-talk@mailman.uni-konstanz.de Subject: Re: [basex-talk] stemming chinese texts Message-ID: d40e4b6e-29ab-f62f-1617-505db18e96a2@college-de-france.fr Content-Type: text/plain; charset="windows-1252"; Format="flowed"
Hi Christian,
I suppose some of my colleagues would be able to judge the quality of your full-text search results.
On the other hand, on code level, I'm not sure I know how to implement an additionnal class that extends abstract Tokenizer class.
Thank you for your help Philippe
Le 14/10/2020 ? 11:00, Christian Gr?n a ?crit?:
Hi Philippe,
Thanks for your mail in private, in which I already gave you a little assessment on what might be necessary to include the CJK tokenizers in BaseX:
The existing Apache code can be adapted and embedded into the BaseX tokenizer infrastructure. On code level, an additional class needs to be implemented that extends abstract Tokenizer class [1].
As far as I can judge, the 3 Lucene CJK analyzers could all be applied to traditional and simplified Chinese. If we found someone who could rate the linguistic quality of our full-text search results, that?d surely be helpful.
Hope this helps, Christian
[1] https://github.com/BaseXdb/basex/blob/master/basex-core/src/main/java/org/ba...
On Tue, Oct 13, 2020 at 12:32 PM Philippe Pons philippe.pons@college-de-france.fr wrote:
Dear Christian,
Thank you very much for this quick and enlightening response.
Without having had (yet) the opportunity to test it, I have indeed read the Japanese text tokenizer. Supporting Chinese tokenization would also be a great help.
I have never tested what Lucene offers, especially since I have to manage texts in traditional Chinese and simplified Chinese (without reading either one myself). I would like to test Lucene's analyzers, but I don't know how to do it in BaseX?
Best regards, Philippe Pons
Le 12/10/2020 ? 12:01, Christian Gr?n a ?crit :
Dear Philippe,
As the Chinese language rarely uses inflection, there is usually no need to perform stemming on texts. However, tokenization will be necessary indeed. Right now, BaseX provides no tokenizer/analyzer for Chinese texts. It should be possible indeed to adopt code from Lucene, as we?ve already done for other languages (our software licenses allow that).
Have you already worked with tokenization of Chinese texts in Lucene? If yes, which of the 3 available analyzers [1] have proven to yield the best results?
As you may know, one of our users, Toshio HIRAI, has contributed a tokenizer for Japanes texts in the past [2]. If we decide to include support for Chinese tokenization, it might as well be interesting to compare the results of the Apache tokenizer with our internal tokenizer.
Cordiales salutations, Christian
[1] https://lucene.apache.org/core/7_2_0/analyzers-common/org/apache/lucene/anal... [2] https://docs.basex.org/wiki/Full-Text:_Japanese
On Mon, Oct 12, 2020 at 11:37 AM Philippe Pons philippe.pons@college-de-france.fr wrote:
Dear BaseX Team,
I'm actually working on chinese texts in TEI. I would like to know if stemming chinese text is possible in BaseX, as we can do with other languages (like english or deutsch)? Or maybe there is a way to add this functionnality with Lucene?
Best regards, Philippe Pons
-- Ing?nieur d'?tude charg? de l'?dition de corpus num?riques Centre de recherche sur les civilisations de l'Asie Orientale CRCAO - UMR 8155 (Coll?ge de France, EPHE, CNRS, PSL Research University, Univ Paris Diderot, Sorbonne Paris Cit?) 49bis avenue de la Belle Gabrielle 75012 Paris https://cv.archives-ouvertes.fr/ppons