Hello --
So my overall goal is to take a bunch of XML, mark all the (generally phrasal) terms of art, take that modified content and mark all the (possibly phrasal) glossary terms, and then go through and remove all the glossary markers that happen to be inside terms of art and then remove all the term-of-art markers. (There's an intermediate step between "found all the possible glossary terms" and "have applied the glossary terms" where the list of candidate terms gets sent off for semantic approval, so the "find a term" steps and "change the documents in which the terms are found" steps have to be distinct.)
My initial problem was marking phrasal terms; the full-text index is very fast and solves the "this rapidly becomes a nightmare with regular expressions, especially regular expressions with no "whole words only" switch, problem, but it marks every word in the phrasal term individually.
I think I have figured out a way to connect the adjacent marked words in the phrasal term into a single mark element. I cannot convince myself that this is the right way; is there a better approach than tumbling windows?
(: db:create("DB", <para id="GUID-12354" >Diverse and various words, some of which are going to be tagged for review as glossary terms.</para>, 'test.xml', map { 'ftindex': true() }) :)
(: example phrasal term :) let $term as xs:string := 'Diverse and various'
for $ft in (db:open('DB')//*[text() contains text { $term } phrase using case sensitive]) return <changed>{ let $contents as node()+ := ft:mark($ft[text() contains text { $term } phrase using case sensitive],'mark') return element {name($contents)} { $contents/@*, (: has to handle hyphens as well as spaces :) for tumbling window $w in $contents/node() start $s when true() end $e previous $eprev next $enext when ( $enext[not(self::mark)] and ($enext[normalize-space()][not(matches(.,'^-$'))]) ) or ($enext[self::mark] and $e[normalize-space()][not(matches(.,'^-$'))]) return if ($w[self::mark]) then <mark>{string-join($w,'')}</mark> else $w } }</changed>
thanks! Graydon