strange behavior of ft:mark() and ft:extract() - BaseX-Talk - mailman.uni-konstanz.de

13 Dec 2012


      ft:mark() and ft:extract() cannot be used with any intermediate looping construct, at least in BaseX 7.3. For example:
...
create db ftex "<r><a>text example string</a></r>"
xquery ft:mark(db:open('ftex')/descendant::text()[. contains text 'example'])
text <mark>example</mark> string
...
xquery db:open('ftex')/descendant::text()[. contains text 'example'] ! ft:mark(.)
text example string
Notice that the use of a loop means that ft:mark() no longer works.
This doesn't work with ft:search() either:
...
create index fulltext
Index 'FULLTEXT' created in 7.63 ms.
...
xquery for $n in ft:search('ftex','example') return ft:mark($n)
text example string
However this works if there's no intermediate variable binding (implicit copying) even if we do some xpath navigation.
...
xquery ft:mark(ft:search('ftex','example')/..)
<a>text <mark>example</mark> string</a>
Query executed in 0.96 ms.
...
xquery ft:mark(ft:search('ftex','example')/../..)
<r>
  <a>text <mark>example</mark> string</a>
</r>
However (this is a separate bug), the document-node() parent/child ordering is messed up:
...
xquery ft:mark(ft:search('ftex','example')/../../..)
<r>
  <a>text </a>
</r>
<mark>example</mark> string
...
xquery ft:mark(ft:search('ftex','example')/ancestor::document-node())
<r>
  <a>text </a>
</r>
<mark>example</mark> string
It seems to me that probably there is some extra hidden metadata attached to the text node which is not preserved by any implicit copying, such as by loops.  (Although I'm not sure what's going on with the document-node() example.)
One example of where this is a big headache is when we want to know in which document was a the text match. Here is a natural implementation:
QUERY:
for $n in ft:search('ftex','example')
return <r doc="{document-uri($n/ancestor::document-node())}">{ft:mark($n)}</r>
RESULT: <r doc="ftex/ftex.xml">text example string</r>
However in this case the text will not be marked.
We can't even do this because the "marked" document does not have a reference to the document-node() (which is understandable, as there's probably a copy/modify/return transform under there):
QUERY:
xquery for $n in ft:mark(ft:search('ftex','example')/..) 
return <r doc="{document-uri($n/ancestor::document-node())}">{$n/(* | text())}</r>
RESULT:
<r doc="">text <mark>example</mark> string</r>
The workaround I used is this:
QUERY:
let $matches := ft:search('ftex','example')
for $n at $i in ft:mark(ft:search('ftex','example')/..)
return <r doc="{document-uri($matches[$i]/ancestor::document-node())}">{$n/(* | text())}</r>
This is extremely ugly, and shares with the last example the downside that I need to mark the *parent* node of the matched text node and then carefully unbox it in the return clause. Quite a "gotcha"!
In addition to being ugly, I'm also worried that it may be incorrect because I don't know for sure that the order of search will always be exactly the same (although it *seems* to be). Even if the order is stable, what if the fulltext index is updated by another session between the first call to ft:search() and the second?
I don't know what the right answer to all of this is, but the way things are seems very not-good. At the very least the problems with ft:mark() and ft:extract() should be documented with big red text!
Perhaps a better method is to have a function with a data structure that contains the text matched text node (as a reference, so that node references are retained) *and* matching substrings explicitly and separately. E.g. a ft:search-with-info('ftex','string example') could return:
(map { 'text' := text('test example string'), 'substrings' := (5,8, 14,6) }, ...)
or maybe this:
(map { 'text' := text('test example string'), 'substrings' := (<s start="5" length="8"/>, <s start="14" length="6"/> ) }, ...)
or maybe just a sequence of alternating items:
(text('test example string'), <matchinfo><s start="5" length="8"/><s start="14" length="6"/></matchinfo>, ...)
Unfortunately I don't see how we can return a simple sequence of elements since the text node result would have to be a copy.
With the results separated we could easily make a family of functions which accepts the same data structure and does mark() and extract()-like things with it.  The pairs can be processed either with a tumbling window or a for loop with a "where $i mod 2" in it.
-- 
Francis Avila