Hi Christian,
To refine the proposal. It would be great if the full-text index could be set up to consider xml:lang attributes in the following way:
* If STEMMING is set to true, then the input to the stemmer should be filtered by matching the xml:lang and the LANGUAGE option. Text that is sent to the tokenizer could be left as is and not be filtered by matching LANGUAGE (see next point).
* If STEMMING is set to false, I agree with you that the general strategy for tokenization is okay. But for correctness it still could be extended to exclude all those scripts that doesn't follow Western-centric tokenization algorithms.
* What concerns the DIACRITICS sensitivity option, probably what is given by Unicode and the collation used by the query is good enough.
What do you think?
Best regards Kristian K
02.07.2017 12:36 Christian Grün kirjutas:
Hi Kristian,
Right now, xml:lang attributes are completely ignored when indexing full-text. It’s an interesting idea to exclude texts that are marked with languages different to the one that is currently applied; I will think about it.
However, I should have mentioned that the language option is mostly irrelevant unless you use stemmers. Tokenization is pretty much the same for Western texts, so searches like the following one…
'Добрый ДЕНЬ!' contains text 'день' using language 'en'
…will still give you the expected result. To some extent, this also applies to Arabian texts:
'يوم سعيد' contains text 'يوم' using language 'en'
Things are definitely different if you work with Japanese or Chinese texts. The following query yields false:
'今日は' contains text '今' using language 'en'
For more information on Japanese tokenization, see Toshio HIRAI’s article in our wiki [1].
Hope this helps, Christian
[1] http://docs.basex.org/wiki/Full-Text:_Japanese
How is the behavior if the database content is in many different languages and is correctly marked with xml:lang attributes. Does the full-text index consider this information and apply full-text indexing only to elements with matching language?
As a simple illustration (does not run): will the following code create full-text index only for the Russian text or for both the russian and the English?
db:create( 'db-ft-ru', <texts> <text xml:lang="ru">something in Russian</text> <text xml:lang="en">something in English</text> </texts>, texts, map { 'ftindex': true(), 'language': 'ru' } )
If BaseX does create the full-text index for both languages (the English index would contain useless scramble) I would propose a simple filtering of xml:lang tags according to the language given in the map to ftindex. This should be simpler to implement than the diversifying as suggested by Christian.
Best regards Kristian K