Re: [basex-talk] Fulltext Stemming

12 Jan 2016


      Hi Günter,
You can play around with ft:normalize to see how the stemmer works:
for $term in ('grün', 'Grüße')
  return <norm original='{ $term }'>{
    ft:normalize($term, map {
      'language': 'German',
      'stemming': true()
    })
  }</norm>
Here is a link to our German stemmer implementation [1]. Maybe there’s
some chance to extend it with a more sophisticated algorithm?
Suggestions are welcome.
Christian
[1] https://github.com/BaseXdb/basex/blob/master/basex-core/src/main/java/org/ba...
On Sun, Jan 10, 2016 at 5:54 PM, kleist kleist@mail.dunzwolff.de wrote:
...
Hi all,
with query options:
contains text "grün" all using diacritics sensitive using stemming using language 'German'
I'll get: "grüne", "grüner", "Grüns" etc.
But I also get  "Gruß", "Grüße", "grüßen" or something like "Gruner" (Eigenname)
Does the stemming have problems with german umlaute and "ß"?
with query options:
contains text "grün" all using stemming using language 'German'
same results, but less. Why is that?
The indexes of the database are the following:
Indexes
 Up-to-date: true
 TEXTINDEX: true
 ATTRINDEX: true
 FTINDEX: true
 TEXTINCLUDE:
 ATTRINCLUDE:
 FTINCLUDE:
 LANGUAGE: German
 STEMMING: true
 CASESENS: true
 DIACRITICS: true
 STOPWORDS:
 UPDINDEX: false
 AUTOOPTIMIZE: false
 MAXCATS: 100
 MAXLEN: 96
 INDEXSPLITSIZE: 0
 FTINDEXSPLITSIZE: 0
Optimized Query is the following:
Optimized Query:
let $hits_137 := ft:mark((db:open-pre("kleist-searchindex",147544), ...)/(((descendant::tei:s union descendant::tei:l)))[descendant::text() contains text "grün" all using diacritics sensitive using stemming using language 'German']) return element div { (element p { ("grün") }, element p { (count($hits_137)) }, element ul { (for $hit_139 in $hits_137 return element li { ($hit_139/descendant-or-self::node()/ancestor::tei:TEI/descendant-or-self::node()/tei:classCode[position() = 1]/string(), $hit_139) }) }) }
Best, Günter

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

Re: [basex-talk] Fulltext Stemming