Re: [basex-talk] Full-text lemmatizing and xml:lang

3 Jul 2017


      Hi Christian,
To refine the proposal. It would be great if the full-text index could 
be set up to consider xml:lang attributes in the following way:
* If STEMMING is set to true, then the input to the stemmer should be 
filtered by matching the xml:lang and the LANGUAGE option. Text that is 
sent to the tokenizer could be left as is and not be filtered by 
matching LANGUAGE (see next point).
* If STEMMING is set to false, I agree with you that the general 
strategy for tokenization is okay. But for correctness it still could be 
extended to exclude all those scripts that doesn't follow 
Western-centric tokenization algorithms.
* What concerns the DIACRITICS sensitivity option, probably what is 
given by Unicode and the collation used by the query is good enough.
What do you think?
Best regards
Kristian K
02.07.2017 12:36 Christian Grün kirjutas:
...
Hi Kristian,
Right now, xml:lang attributes are completely ignored when indexing
full-text. It’s an interesting idea to exclude texts that are marked
with languages different to the one that is currently applied; I will
think about it.
However, I should have mentioned that the language option is mostly
irrelevant unless you use stemmers. Tokenization is pretty much the
same for Western texts, so searches like the following one…
'Добрый ДЕНЬ!' contains text 'день'
     using language 'en'
…will still give you the expected result. To some extent, this also
applies to Arabian texts:
'يوم سعيد' contains text 'يوم'
     using language 'en'
Things are definitely different if you work with Japanese or Chinese
texts. The following query yields false:
'今日は' contains text '今'
     using language 'en'
For more information on Japanese tokenization, see Toshio HIRAI’s
article in our wiki [1].
Hope this helps,
Christian
[1] http://docs.basex.org/wiki/Full-Text:_Japanese
...
How is the behavior if the database content is in many different languages
and is correctly marked with xml:lang attributes. Does the full-text index
consider this information and apply full-text indexing only to elements with
matching language?
As a simple illustration (does not run): will the following code create
full-text index only for the Russian text or for both the russian and the
English?
db:create(
     'db-ft-ru',
     <texts>
       <text xml:lang="ru">something in Russian</text>
       <text xml:lang="en">something in English</text>
     </texts>,
     texts,
     map { 'ftindex': true(), 'language': 'ru' }
   )
If BaseX does create the full-text index for both languages (the English
index would contain useless scramble) I would propose a simple filtering of
xml:lang tags according to the language given in the map to ftindex. This
should be simpler to implement than the diversifying as suggested by
Christian.
Best regards
Kristian K

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

Re: [basex-talk] Full-text lemmatizing and xml:lang