Hi Tim  -

On Wed, Apr 13, 2022 at 11:40 AM Tim Thompson <timathom@gmail.com> wrote:
I'm currently involved in a project that's using MarkLogic, and I noticed that its implementation of English-language stemming differs from that of BaseX: e.g., "mouse" and "mice" both stem to "mouse."

In BaseX, those words are stemmed separately. Is this a known limitation of the internal English syntax parser?

It's my (admittedly, *VERY*) limited understanding that the BaseX stemmer, at least for English, is limited to the Porter Stemmer[1]. The Porter Stemmer just stems, and doesn't handle stemming from plurals to singulars in the case of apophonic plurals.

It'd be interesting to learn what stemmer(s) MarkLogic uses.

And, while I'm not that familiar with it (and it would probably entail significant work to implement), the `ft:thesaurus()` function provides similar functionality:
```
ft:thesaurus(
  <thesaurus>
    <entry>
      <term>mice</term>
      <synonym>
        <term>mouse</term>
        <relationship>NT</relationship>
      </synonym>
      <synonym>
        <term>rodent</term>
        <relationship>BTG</relationship>
      </synonym>
    </entry>
  </thesaurus>,
  'mice'
)
```
 
Example:

db:create("stem-test",
  <data>
    <x>mouse</x>
    <y>mice</y>
  </data>
  , "data", map {"ftindex": true(), "stemming": true(), "language": "en"}
)
,
update:output(
  ft:search("stem-test", "mice")  
)


Thanks,
Tim



Best,
Bridger

[1]  https://github.com/BaseXdb/basex/blob/da1e55d0214e44c1532f121c282021db50a9aa51/basex-core/src/main/java/org/basex/util/ft/EnglishStemmer.java


--
Tim A. Thompson (he, him)
Librarian for Applied Metadata Research
Yale University Library