Hi Graydon,
It’s a good idea to use the window clause (as the number of mark elements that need to be joined is not known in advance). You can use ft:tokenize to include other delimiters:
for $term in ('Diverse and various', 'words… some', 'glossary-terms') for $ft in ft:mark(db:open('DB')//*[text() contains text { $term }]) return element { name($ft) } { $ft/@*, for tumbling window $w in $ft/node() start when true() end $e next $enext when ( $enext[not(self::mark)] and $enext[exists(ft:tokenize(.))] or $enext[self::mark] and $e[exists(ft:tokenize(.))] ) return if ($w[self::mark]) then <mark>{ string-join($w) }</mark> else $w }
If you don’t want to rebuild your original node, you can also use the 'update' expression and modify your existing document. I have slightly rewritten the original code, but the basic idea is the same:
for $term in ('Diverse and various', 'words… some', 'glossary-terms') for $ft in ft:mark(db:open('DB')//*[text() contains text { $term }]) return $ft update { for tumbling window $w in node() start $s when $s/self::mark end $curr next $next when ( exists(ft:tokenize($curr)) and exists($next/self::mark) or exists(ft:tokenize($next)) and empty ($next/self::mark) ) return ( replace node head($w) with element mark { string-join($w) }, delete nodes tail($w) ) }
Hope this helps, Christian
On Sun, Apr 26, 2020 at 6:04 AM Graydon graydonish@gmail.com wrote:
On Sat, Apr 25, 2020 at 06:02:14PM -0400, Liam R. E. Quin scripsit:
On Sat, 2020-04-25 at 13:46 -0400, Graydon Saunders wrote:
I think I have figured out a way to connect the adjacent marked words in the phrasal term into a single mark element. I cannot convince myself that this is the right way; is there a better approach than tumbling windows?
I just search for the multi-word phrase and surround that. Enclosed is a sample from a prototype for a keyword in context search index for fromoldbooks.org (not yet live). Lookognow i see it's not very neat but maybe it'll give some ideas.
It does, but alas I can't use string-join. Some of the terms have hyphens, so I'm getting <mark>A</mark>-<mark>List</mark> coming out of the full text search, which must become <mark>A-List</mark>. Plus some of the terms have the form "nine-pence and six-pence", so any solution has to be general for interstitial text nodes.
(I can't rule out any punctuation. I know there are hyphens, but don't know there are ONLY hyphens.)
Thanks!
Graydon