Dear Tim,
It’s exactly how Bridger said: The Porter algorithm is one of the fastest available, but it doesn’t provide support for various edge cases, including words, however, that are very common, such as (wo)?m(a|e)n, child(ren)? or t(oo|ee)th.
We could switch to a more advanced solution. I assume that MarkLogic’s Bitext algorithms are only commercially available, but I remember that Apache Lucene provides other English stemmers as well, which I haven’t tested yet. The vast number of English plural forms is regular, so we could also include a dictionary for the most frequent irregular forms.
I have attached a list of frequent English words with irregular plural forms to this mail. I would be interested to learn if all of these are correctly stemmed by MarkLogic. Would you like to give it a try?
Thanks & cheers, Christian __________________________________
addendum addenda alumnus alumni analysis analyses antithesis antitheses apex apices appendix appendices axis axes bacillus bacilli bacterium bacteria basis bases beau beaux bureau bureaux cactus cacti cello celli château châteaux cherub cherubim child children codex codices concerto concerti corpus corpora crisis crises criterion criteria curriculum curricula datum data diagnosis diagnoses die dice dwarf dwarves ellipsis ellipses erratum errata fez fezzes focus foci foot feet fungus fungi genus genera goose geese graffito graffiti half halves hippopotamus hippopotami hoof hooves hypothesis hypotheses index indices kibbutz kibbutzim lemma lemmata libretto libretti loaf loaves locus loci louse lice man men matrix matrices medium media memorandum memoranda mouse mice nucleus nuclei oasis oases opus opera ovum ova ox oxen parenthesis parentheses phenomenon phenomena phylum phyla polyhedron polyhedra quiz quizzes radius radii referendum referenda scarf scarves schema schemata self selves stigma stigmata stimulus stimuli stratum strata syllabus syllabi symposium symposia synopsis synopses tableau tableaux thesis theses thief thieves timpano timpani tooth teeth uterus uteri vertex vertices vortex vortices wharf wharves wife wives wolf wolves woman women __________________________________
On Thu, Apr 14, 2022 at 12:14 AM Bridger Dyson-Smith bdysonsmith@gmail.com wrote:
Tim -
On Wed, Apr 13, 2022, 4:53 PM Tim Thompson timathom@gmail.com wrote:
Thanks, Bridger--that's very helpful! I'm not sure what MarkLogic is using exactly, but it seems fairly sophisticated (there's even an advanced option for multiple stemming: e.g., "further" has "far," "farther," "further" as stems).
Indeed, MarkLogic appears to offer a number of stemmers, some of which offer lemmatization functionality. I couldn't say if adding this type of capability to BaseX is feasible.
All best, Tim
Best, Bridger
-- Tim A. Thompson (he, him) Librarian for Applied Metadata Research Yale University Library
On Wed, Apr 13, 2022 at 12:13 PM Bridger Dyson-Smith bdysonsmith@gmail.com wrote:
Hi Tim -
On Wed, Apr 13, 2022 at 11:40 AM Tim Thompson timathom@gmail.com wrote:
I'm currently involved in a project that's using MarkLogic, and I noticed that its implementation of English-language stemming differs from that of BaseX: e.g., "mouse" and "mice" both stem to "mouse."
In BaseX, those words are stemmed separately. Is this a known limitation of the internal English syntax parser?
It's my (admittedly, *VERY*) limited understanding that the BaseX stemmer, at least for English, is limited to the Porter Stemmer[1]. The Porter Stemmer just stems, and doesn't handle stemming from plurals to singulars in the case of apophonic plurals.
It'd be interesting to learn what stemmer(s) MarkLogic uses.
And, while I'm not that familiar with it (and it would probably entail significant work to implement), the `ft:thesaurus()` function provides similar functionality:
ft:thesaurus( <thesaurus> <entry> <term>mice</term> <synonym> <term>mouse</term> <relationship>NT</relationship> </synonym> <synonym> <term>rodent</term> <relationship>BTG</relationship> </synonym> </entry> </thesaurus>, 'mice' )
Example:
db:create("stem-test",
<data> <x>mouse</x> <y>mice</y> </data> , "data", map {"ftindex": true(), "stemming": true(), "language": "en"} ) , update:output( ft:search("stem-test", "mice") )
Thanks, Tim
Best, Bridger
[1] https://github.com/BaseXdb/basex/blob/da1e55d0214e44c1532f121c282021db50a9aa...
-- Tim A. Thompson (he, him) Librarian for Applied Metadata Research Yale University Library