Dear Tim,
It’s exactly how Bridger said: The Porter algorithm is one of the
fastest available, but it doesn’t provide support for various edge
cases, including words, however, that are very common, such as
(wo)?m(a|e)n, child(ren)? or t(oo|ee)th.
We could switch to a more advanced solution. I assume that MarkLogic’s
Bitext algorithms are only commercially available, but I remember that
Apache Lucene provides other English stemmers as well, which I haven’t
tested yet. The vast number of English plural forms is regular, so we
could also include a dictionary for the most frequent irregular forms.
I have attached a list of frequent English words with irregular plural
forms to this mail. I would be interested to learn if all of these are
correctly stemmed by MarkLogic. Would you like to give it a try?
Thanks & cheers,
Christian
__________________________________
addendum addenda
alumnus alumni
analysis analyses
antithesis antitheses
apex apices
appendix appendices
axis axes
bacillus bacilli
bacterium bacteria
basis bases
beau beaux
bureau bureaux
cactus cacti
cello celli
château châteaux
cherub cherubim
child children
codex codices
concerto concerti
corpus corpora
crisis crises
criterion criteria
curriculum curricula
datum data
diagnosis diagnoses
die dice
dwarf dwarves
ellipsis ellipses
erratum errata
fez fezzes
focus foci
foot feet
fungus fungi
genus genera
goose geese
graffito graffiti
half halves
hippopotamus hippopotami
hoof hooves
hypothesis hypotheses
index indices
kibbutz kibbutzim
lemma lemmata
libretto libretti
loaf loaves
locus loci
louse lice
man men
matrix matrices
medium media
memorandum memoranda
mouse mice
nucleus nuclei
oasis oases
opus opera
ovum ova
ox oxen
parenthesis parentheses
phenomenon phenomena
phylum phyla
polyhedron polyhedra
quiz quizzes
radius radii
referendum referenda
scarf scarves
schema schemata
self selves
stigma stigmata
stimulus stimuli
stratum strata
syllabus syllabi
symposium symposia
synopsis synopses
tableau tableaux
thesis theses
thief thieves
timpano timpani
tooth teeth
uterus uteri
vertex vertices
vortex vortices
wharf wharves
wife wives
wolf wolves
woman women
__________________________________
On Thu, Apr 14, 2022 at 12:14 AM Bridger Dyson-Smith
<bdysonsmith@gmail.com> wrote:
>
> Tim -
>
> On Wed, Apr 13, 2022, 4:53 PM Tim Thompson <timathom@gmail.com> wrote:
>>
>> Thanks, Bridger--that's very helpful! I'm not sure what MarkLogic is using exactly, but it seems fairly sophisticated (there's even an advanced option for multiple stemming: e.g., "further" has "far," "farther," "further" as stems).
>
>
> Indeed, MarkLogic appears to offer a number of stemmers, some of which offer lemmatization functionality. I couldn't say if adding this type of capability to BaseX is feasible.
>>
>>
>> All best,
>> Tim
>>
>
> Best,
> Bridger
>
>>
>> --
>> Tim A. Thompson (he, him)
>> Librarian for Applied Metadata Research
>> Yale University Library
>>
>>
>>
>> On Wed, Apr 13, 2022 at 12:13 PM Bridger Dyson-Smith <bdysonsmith@gmail.com> wrote:
>>>
>>> Hi Tim -
>>>
>>> On Wed, Apr 13, 2022 at 11:40 AM Tim Thompson <timathom@gmail.com> wrote:
>>>>
>>>> I'm currently involved in a project that's using MarkLogic, and I noticed that its implementation of English-language stemming differs from that of BaseX: e.g., "mouse" and "mice" both stem to "mouse."
>>>>
>>>> In BaseX, those words are stemmed separately. Is this a known limitation of the internal English syntax parser?
>>>>
>>> It's my (admittedly, *VERY*) limited understanding that the BaseX stemmer, at least for English, is limited to the Porter Stemmer[1]. The Porter Stemmer just stems, and doesn't handle stemming from plurals to singulars in the case of apophonic plurals.
>>>
>>> It'd be interesting to learn what stemmer(s) MarkLogic uses.
>>>
>>> And, while I'm not that familiar with it (and it would probably entail significant work to implement), the `ft:thesaurus()` function provides similar functionality:
>>> ```
>>> ft:thesaurus(
>>> <thesaurus>
>>> <entry>
>>> <term>mice</term>
>>> <synonym>
>>> <term>mouse</term>
>>> <relationship>NT</relationship>
>>> </synonym>
>>> <synonym>
>>> <term>rodent</term>
>>> <relationship>BTG</relationship>
>>> </synonym>
>>> </entry>
>>> </thesaurus>,
>>> 'mice'
>>> )
>>> ```
>>>
>>>>
>>>> Example:
>>>>
>>>> db:create("stem-test",
>>>> <data>
>>>> <x>mouse</x>
>>>> <y>mice</y>
>>>> </data>
>>>> , "data", map {"ftindex": true(), "stemming": true(), "language": "en"}
>>>> )
>>>> ,
>>>> update:output(
>>>> ft:search("stem-test", "mice")
>>>> )
>>>>
>>>>
>>>> Thanks,
>>>> Tim
>>>>
>>>>
>>>
>>> Best,
>>> Bridger
>>>
>>> [1] https://github.com/BaseXdb/basex/blob/da1e55d0214e44c1532f121c282021db50a9aa51/basex-core/src/main/java/org/basex/util/ft/EnglishStemmer.java
>>>
>>>
>>>> --
>>>> Tim A. Thompson (he, him)
>>>> Librarian for Applied Metadata Research
>>>> Yale University Library
>>>>