Re: [basex-talk] Full-text lemmatizing and xml:lang

4 Jul 2017

      Thanks. I’ll keep this proposal in mind, and think about further
implications. If we decided one day to make the full-text index
updatable (which would be a nice feature, but a lot of work), we would
probably need to reindex sub-trees with modified language attributes.
On Tue, Jul 4, 2017 at 8:32 AM, Kristian Kankainen
kristian@keeleleek.ee wrote:
...
Yes, you are correct.
During index building, only <div xml:lang='de'>Häuser</div> is lemmatized,
thus
//div[text() contains text { "houses","Häuser" }
  using language 'de'
  using stemming
]
returns only the element with Häuser. But a query without stemming and
language:
//div[text() contains text { "houses","Häuser" }]
would return both elements.
Best regards
Kristian K
03.07.2017 19:50 Christian Grün kirjutas:
...
To be sure if I understood you correctly:
...

If STEMMING is set to true, then the input to the stemmer should be

filtered by matching the xml:lang and the LANGUAGE option. Text that is
sent
to the tokenizer could be left as is and not be filtered by matching
LANGUAGE (see next point).
So you would prefer to have all words indexed, but reduce the stemming
step to the chosen language, right?
To give an example:
<xml>
   <div xml:lang='de'>Häuser</div>
   <div xml:lang='en'>houses</div>
</xml>
If stemming is enabled, and if language is 'de', the index would
include the two terms 'Haus' (stemmed German form) and 'Houses'
(original English form).
The query…
//div[text() contains text { "houses","Häuser" }
   using language 'de'
   using stemming
]
…would only return the German div element (as the German stemmer rewrites
'Häuser' to 'Haus' and 'houses' to 'hou').

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

Re: [basex-talk] Full-text lemmatizing and xml:lang