Thanks. I’ll keep this proposal in mind, and think about further implications. If we decided one day to make the full-text index updatable (which would be a nice feature, but a lot of work), we would probably need to reindex sub-trees with modified language attributes.
On Tue, Jul 4, 2017 at 8:32 AM, Kristian Kankainen kristian@keeleleek.ee wrote:
Yes, you are correct.
During index building, only <div xml:lang='de'>Häuser</div> is lemmatized, thus
//div[text() contains text { "houses","Häuser" } using language 'de' using stemming ]
returns only the element with Häuser. But a query without stemming and language:
//div[text() contains text { "houses","Häuser" }]
would return both elements.
Best regards Kristian K
03.07.2017 19:50 Christian Grün kirjutas:
To be sure if I understood you correctly:
- If STEMMING is set to true, then the input to the stemmer should be
filtered by matching the xml:lang and the LANGUAGE option. Text that is sent to the tokenizer could be left as is and not be filtered by matching LANGUAGE (see next point).
So you would prefer to have all words indexed, but reduce the stemming step to the chosen language, right?
To give an example:
<xml> <div xml:lang='de'>Häuser</div> <div xml:lang='en'>houses</div> </xml>
If stemming is enabled, and if language is 'de', the index would include the two terms 'Haus' (stemmed German form) and 'Houses' (original English form).
The query…
//div[text() contains text { "houses","Häuser" } using language 'de' using stemming ]
…would only return the German div element (as the German stemmer rewrites 'Häuser' to 'Haus' and 'houses' to 'hou').