Hi Graydon,
It’s a good idea to use the window clause (as the number of mark
elements that need to be joined is not known in advance). You can use
ft:tokenize to include other delimiters:
for $term in ('Diverse and various', 'words… some', 'glossary-terms')
for $ft in ft:mark(db:open('DB')//*[text() contains text { $term }])
return element { name($ft) } {
$ft/@*,
for tumbling window $w in $ft/node()
start when true()
end $e next $enext when (
$enext[not(self::mark)] and $enext[exists(ft:tokenize(.))] or
$enext[self::mark] and $e[exists(ft:tokenize(.))]
)
return if ($w[self::mark]) then <mark>{ string-join($w) }</mark> else $w
}
If you don’t want to rebuild your original node, you can also use the
'update' expression and modify your existing document. I have slightly
rewritten the original code, but the basic idea is the same:
for $term in ('Diverse and various', 'words… some', 'glossary-terms')
for $ft in ft:mark(db:open('DB')//*[text() contains text { $term }])
return $ft update {
for tumbling window $w in node()
start $s when $s/self::mark
end $curr next $next when (
exists(ft:tokenize($curr)) and exists($next/self::mark) or
exists(ft:tokenize($next)) and empty ($next/self::mark)
)
return (
replace node head($w) with element mark { string-join($w) },
delete nodes tail($w)
)
}
Hope this helps,
Christian
On Sun, Apr 26, 2020 at 6:04 AM Graydon <graydonish@gmail.com> wrote:
>
> On Sat, Apr 25, 2020 at 06:02:14PM -0400, Liam R. E. Quin scripsit:
> > On Sat, 2020-04-25 at 13:46 -0400, Graydon Saunders wrote:
> > > I think I have figured out a way to connect the adjacent marked
> > > words in the phrasal term into a single mark element. I cannot
> > > convince myself that this is the right way; is there a better
> > > approach than tumbling windows?
> >
> > I just search for the multi-word phrase and surround that. Enclosed is
> > a sample from a prototype for a keyword in context search index for
> > fromoldbooks.org (not yet live). Lookognow i see it's not very neat
> > but maybe it'll give some ideas.
>
> It does, but alas I can't use string-join. Some of the terms have
> hyphens, so I'm getting <mark>A</mark>-<mark>List</mark> coming out of
> the full text search, which must become <mark>A-List</mark>. Plus some
> of the terms have the form "nine-pence and six-pence", so any solution
> has to be general for interstitial text nodes.
>
> (I can't rule out any punctuation. I know there are hyphens, but don't
> know there are ONLY hyphens.)
>
> Thanks!
>
> Graydon