Tim -
On Wed, Apr 13, 2022, 4:53 PM Tim Thompson timathom@gmail.com wrote:
Thanks, Bridger--that's very helpful! I'm not sure what MarkLogic is using exactly, but it seems fairly sophisticated (there's even an advanced option for multiple stemming: e.g., "further" has "far," "farther," "further" as stems).
Indeed, MarkLogic appears to offer a number of stemmers, some of which offer lemmatization functionality. I couldn't say if adding this type of capability to BaseX is feasible.
All best, Tim
Best, Bridger
-- Tim A. Thompson (he, him) Librarian for Applied Metadata Research Yale University Library
On Wed, Apr 13, 2022 at 12:13 PM Bridger Dyson-Smith < bdysonsmith@gmail.com> wrote:
Hi Tim -
On Wed, Apr 13, 2022 at 11:40 AM Tim Thompson timathom@gmail.com wrote:
I'm currently involved in a project that's using MarkLogic, and I noticed that its implementation of English-language stemming differs from that of BaseX: e.g., "mouse" and "mice" both stem to "mouse."
In BaseX, those words are stemmed separately. Is this a known limitation of the internal English syntax parser?
It's my (admittedly, *VERY*) limited understanding that the BaseX
stemmer, at least for English, is limited to the Porter Stemmer[1]. The Porter Stemmer just stems, and doesn't handle stemming from plurals to singulars in the case of apophonic plurals.
It'd be interesting to learn what stemmer(s) MarkLogic uses.
And, while I'm not that familiar with it (and it would probably entail significant work to implement), the `ft:thesaurus()` function provides similar functionality:
ft:thesaurus( <thesaurus> <entry> <term>mice</term> <synonym> <term>mouse</term> <relationship>NT</relationship> </synonym> <synonym> <term>rodent</term> <relationship>BTG</relationship> </synonym> </entry> </thesaurus>, 'mice' )
Example:
db:create("stem-test",
<data> <x>mouse</x> <y>mice</y> </data> , "data", map {"ftindex": true(), "stemming": true(), "language": "en"} ) , update:output( ft:search("stem-test", "mice") )
Thanks, Tim
Best, Bridger
[1] https://github.com/BaseXdb/basex/blob/da1e55d0214e44c1532f121c282021db50a9aa...
--
Tim A. Thompson (he, him) Librarian for Applied Metadata Research Yale University Library