Sorry for being slow to follow up on this. I just ran your list through the
stemming function in ML--results are attached. The original singular and
plural forms are recorded in @a and @b, whereas the corresponding stemmed
forms are recorded in <a> and <b>.
The results look pretty good to me, with some odd exceptions: e.g., "celli"
is stemmed to "celly," "timpani" is stemmed to "timpani tympanum," and
"men" is stemmed to both "man" and "cocksman"--shaking my head on that one!
--
Tim A. Thompson (he, him)
Librarian for Applied Metadata Research
Yale University Library
On Thu, Apr 14, 2022 at 5:06 AM Christian Grün
christian.gruen@gmail.com
wrote:
> Dear Tim,
>
> It’s exactly how Bridger said: The Porter algorithm is one of the
> fastest available, but it doesn’t provide support for various edge
> cases, including words, however, that are very common, such as
> (wo)?m(a|e)n, child(ren)? or t(oo|ee)th.
>
> We could switch to a more advanced solution. I assume that MarkLogic’s
> Bitext algorithms are only commercially available, but I remember that
> Apache Lucene provides other English stemmers as well, which I haven’t
> tested yet. The vast number of English plural forms is regular, so we
> could also include a dictionary for the most frequent irregular forms.
>
> I have attached a list of frequent English words with irregular plural
> forms to this mail. I would be interested to learn if all of these are
> correctly stemmed by MarkLogic. Would you like to give it a try?
>
> Thanks & cheers,
> Christian
> __________________________________
>
> addendum addenda
> alumnus alumni
> analysis analyses
> antithesis antitheses
> apex apices
> appendix appendices
> axis axes
> bacillus bacilli
> bacterium bacteria
> basis bases
> beau beaux
> bureau bureaux
> cactus cacti
> cello celli
> château châteaux
> cherub cherubim
> child children
> codex codices
> concerto concerti
> corpus corpora
> crisis crises
> criterion criteria
> curriculum curricula
> datum data
> diagnosis diagnoses
> die dice
> dwarf dwarves
> ellipsis ellipses
> erratum errata
> fez fezzes
> focus foci
> foot feet
> fungus fungi
> genus genera
> goose geese
> graffito graffiti
> half halves
> hippopotamus hippopotami
> hoof hooves
> hypothesis hypotheses
> index indices
> kibbutz kibbutzim
> lemma lemmata
> libretto libretti
> loaf loaves
> locus loci
> louse lice
> man men
> matrix matrices
> medium media
> memorandum memoranda
> mouse mice
> nucleus nuclei
> oasis oases
> opus opera
> ovum ova
> ox oxen
> parenthesis parentheses
> phenomenon phenomena
> phylum phyla
> polyhedron polyhedra
> quiz quizzes
> radius radii
> referendum referenda
> scarf scarves
> schema schemata
> self selves
> stigma stigmata
> stimulus stimuli
> stratum strata
> syllabus syllabi
> symposium symposia
> synopsis synopses
> tableau tableaux
> thesis theses
> thief thieves
> timpano timpani
> tooth teeth
> uterus uteri
> vertex vertices
> vortex vortices
> wharf wharves
> wife wives
> wolf wolves
> woman women
> __________________________________
>
>
> On Thu, Apr 14, 2022 at 12:14 AM Bridger Dyson-Smith
>
bdysonsmith@gmail.com wrote:
> >
> > Tim -
> >
> > On Wed, Apr 13, 2022, 4:53 PM Tim Thompson
timathom@gmail.com wrote:
> >>
> >> Thanks, Bridger--that's very helpful! I'm not sure what MarkLogic is
> using exactly, but it seems fairly sophisticated (there's even an advanced
> option for multiple stemming: e.g., "further" has "far," "farther,"
> "further" as stems).
> >
> >
> > Indeed, MarkLogic appears to offer a number of stemmers, some of which
> offer lemmatization functionality. I couldn't say if adding this type of
> capability to BaseX is feasible.
> >>
> >>
> >> All best,
> >> Tim
> >>
> >
> > Best,
> > Bridger
> >
> >>
> >> --
> >> Tim A. Thompson (he, him)
> >> Librarian for Applied Metadata Research
> >> Yale University Library
> >>
> >>
> >>
> >> On Wed, Apr 13, 2022 at 12:13 PM Bridger Dyson-Smith <
> bdysonsmith@gmail.com> wrote:
> >>>
> >>> Hi Tim -
> >>>
> >>> On Wed, Apr 13, 2022 at 11:40 AM Tim Thompson
timathom@gmail.com
> wrote:
> >>>>
> >>>> I'm currently involved in a project that's using MarkLogic, and I
> noticed that its implementation of English-language stemming differs from
> that of BaseX: e.g., "mouse" and "mice" both stem to "mouse."
> >>>>
> >>>> In BaseX, those words are stemmed separately. Is this a known
> limitation of the internal English syntax parser?
> >>>>
> >>> It's my (admittedly, *VERY*) limited understanding that the BaseX
> stemmer, at least for English, is limited to the Porter Stemmer[1]. The
> Porter Stemmer just stems, and doesn't handle stemming from plurals to
> singulars in the case of apophonic plurals.
> >>>
> >>> It'd be interesting to learn what stemmer(s) MarkLogic uses.
> >>>
> >>> And, while I'm not that familiar with it (and it would probably entail
> significant work to implement), the `ft:thesaurus()` function provides
> similar functionality:
> >>> ```
> >>> ft:thesaurus(
> >>> <thesaurus>
> >>> <entry>
> >>> <term>mice</term>
> >>> <synonym>
> >>> <term>mouse</term>
> >>> <relationship>NT</relationship>
> >>> </synonym>
> >>> <synonym>
> >>> <term>rodent</term>
> >>> <relationship>BTG</relationship>
> >>> </synonym>
> >>> </entry>
> >>> </thesaurus>,
> >>> 'mice'
> >>> )
> >>> ```
> >>>
> >>>>
> >>>> Example:
> >>>>
> >>>> db:create("stem-test",
> >>>> <data>
> >>>> <x>mouse</x>
> >>>> <y>mice</y>
> >>>> </data>
> >>>> , "data", map {"ftindex": true(), "stemming": true(), "language":
> "en"}
> >>>> )
> >>>> ,
> >>>> update:output(
> >>>> ft:search("stem-test", "mice")
> >>>> )
> >>>>
> >>>>
> >>>> Thanks,
> >>>> Tim
> >>>>
> >>>>
> >>>
> >>> Best,
> >>> Bridger
> >>>
> >>> [1]
>
https://github.com/BaseXdb/basex/blob/da1e55d0214e44c1532f121c282021db50a9aa...
> >>>
> >>>
> >>>> --
> >>>> Tim A. Thompson (he, him)
> >>>> Librarian for Applied Metadata Research
> >>>> Yale University Library
> >>>>
>