Hi Günter,
You can play around with ft:normalize to see how the stemmer works:
for $term in ('grün', 'Grüße') return <norm original='{ $term }'>{ ft:normalize($term, map { 'language': 'German', 'stemming': true() }) }</norm>
Here is a link to our German stemmer implementation [1]. Maybe there’s some chance to extend it with a more sophisticated algorithm? Suggestions are welcome.
Christian
[1] https://github.com/BaseXdb/basex/blob/master/basex-core/src/main/java/org/ba...
On Sun, Jan 10, 2016 at 5:54 PM, kleist kleist@mail.dunzwolff.de wrote:
Hi all,
with query options: contains text "grün" all using diacritics sensitive using stemming using language 'German'
I'll get: "grüne", "grüner", "Grüns" etc.
But I also get "Gruß", "Grüße", "grüßen" or something like "Gruner" (Eigenname)
Does the stemming have problems with german umlaute and "ß"?
with query options: contains text "grün" all using stemming using language 'German' same results, but less. Why is that?
The indexes of the database are the following: Indexes Up-to-date: true TEXTINDEX: true ATTRINDEX: true FTINDEX: true TEXTINCLUDE: ATTRINCLUDE: FTINCLUDE: LANGUAGE: German STEMMING: true CASESENS: true DIACRITICS: true STOPWORDS: UPDINDEX: false AUTOOPTIMIZE: false MAXCATS: 100 MAXLEN: 96 INDEXSPLITSIZE: 0 FTINDEXSPLITSIZE: 0
Optimized Query is the following: Optimized Query: let $hits_137 := ft:mark((db:open-pre("kleist-searchindex",147544), ...)/(((descendant::tei:s union descendant::tei:l)))[descendant::text() contains text "grün" all using diacritics sensitive using stemming using language 'German']) return element div { (element p { ("grün") }, element p { (count($hits_137)) }, element ul { (for $hit_139 in $hits_137 return element li { ($hit_139/descendant-or-self::node()/ancestor::tei:TEI/descendant-or-self::node()/tei:classCode[position() = 1]/string(), $hit_139) }) }) }
Best, Günter