ft:mark() and ft:extract() cannot be used with any intermediate looping construct, at least in BaseX 7.3. For example:
create db ftex "<r><a>text example string</a></r>" xquery ft:mark(db:open('ftex')/descendant::text()[. contains text 'example'])
text <mark>example</mark> string
xquery db:open('ftex')/descendant::text()[. contains text 'example'] ! ft:mark(.)
text example string
Notice that the use of a loop means that ft:mark() no longer works.
This doesn't work with ft:search() either:
create index fulltext
Index 'FULLTEXT' created in 7.63 ms.
xquery for $n in ft:search('ftex','example') return ft:mark($n)
text example string
However this works if there's no intermediate variable binding (implicit copying) even if we do some xpath navigation.
xquery ft:mark(ft:search('ftex','example')/..)
<a>text <mark>example</mark> string</a> Query executed in 0.96 ms.
xquery ft:mark(ft:search('ftex','example')/../..)
<r> <a>text <mark>example</mark> string</a> </r>
However (this is a separate bug), the document-node() parent/child ordering is messed up:
xquery ft:mark(ft:search('ftex','example')/../../..)
<r> <a>text </a> </r> <mark>example</mark> string
xquery ft:mark(ft:search('ftex','example')/ancestor::document-node())
<r> <a>text </a> </r> <mark>example</mark> string
It seems to me that probably there is some extra hidden metadata attached to the text node which is not preserved by any implicit copying, such as by loops. (Although I'm not sure what's going on with the document-node() example.)
One example of where this is a big headache is when we want to know in which document was a the text match. Here is a natural implementation:
QUERY: for $n in ft:search('ftex','example') return <r doc="{document-uri($n/ancestor::document-node())}">{ft:mark($n)}</r>
RESULT: <r doc="ftex/ftex.xml">text example string</r>
However in this case the text will not be marked.
We can't even do this because the "marked" document does not have a reference to the document-node() (which is understandable, as there's probably a copy/modify/return transform under there):
QUERY: xquery for $n in ft:mark(ft:search('ftex','example')/..) return <r doc="{document-uri($n/ancestor::document-node())}">{$n/(* | text())}</r>
RESULT: <r doc="">text <mark>example</mark> string</r>
The workaround I used is this:
QUERY: let $matches := ft:search('ftex','example') for $n at $i in ft:mark(ft:search('ftex','example')/..) return <r doc="{document-uri($matches[$i]/ancestor::document-node())}">{$n/(* | text())}</r>
This is extremely ugly, and shares with the last example the downside that I need to mark the *parent* node of the matched text node and then carefully unbox it in the return clause. Quite a "gotcha"!
In addition to being ugly, I'm also worried that it may be incorrect because I don't know for sure that the order of search will always be exactly the same (although it *seems* to be). Even if the order is stable, what if the fulltext index is updated by another session between the first call to ft:search() and the second?
I don't know what the right answer to all of this is, but the way things are seems very not-good. At the very least the problems with ft:mark() and ft:extract() should be documented with big red text!
Perhaps a better method is to have a function with a data structure that contains the text matched text node (as a reference, so that node references are retained) *and* matching substrings explicitly and separately. E.g. a ft:search-with-info('ftex','string example') could return:
(map { 'text' := text('test example string'), 'substrings' := (5,8, 14,6) }, ...)
or maybe this:
(map { 'text' := text('test example string'), 'substrings' := (<s start="5" length="8"/>, <s start="14" length="6"/> ) }, ...)
or maybe just a sequence of alternating items:
(text('test example string'), <matchinfo><s start="5" length="8"/><s start="14" length="6"/></matchinfo>, ...)
Unfortunately I don't see how we can return a simple sequence of elements since the text node result would have to be a copy.
With the results separated we could easily make a family of functions which accepts the same data structure and does mark() and extract()-like things with it. The pairs can be processed either with a tumbling window or a for loop with a "where $i mod 2" in it.
Hi Francis,
ft:mark() and ft:extract() cannot be used with any intermediate looping construct, at least in BaseX 7.3. [...]
Good point. I was surprised to see that this has not been covered yet in our documentation. I have updated the module page and hope it’s clearer now [1] (even if I sticked with black as text color ;). The reason for this behavior is that position information can easily blow up main memory, and it’s a non-trivial optimization task to find out which position information will later be required by an expression like ft:mark() or ft:extract(). However, the behavior may change in future versions of BaseX.
The usual workaround is to use more than one full-text expression
let $term := 'welcome' for $ft in db:open( 'DB' )//*[text() contains text { $term }] return element hit { ft:extract( $ft[text() contains text { $term }] ) }
I agree that this creates redundant code and not how it should ideally be, but at least it’s usually no bottleneck regarding performance. In most of our productive applications that use "contains text" or ft:search(), the overall query code is much more complex anyway (extendiing across several functions) such that we are hardly confronted with this restriction, which is one of the reasons why we didn’t push the optimizations any further.
Perhaps a better method is to have a function with a data structure that contains the text matched text node (as a reference, so that node references are retained) *and* matching substrings explicitly and separately. [...]
True; we could think about further splitting up the process, and introduce more low-level functions that directly return position information. Our original plan was to focus on the XQuery Full Text specification, but it more and more urns out that our users switch over to our BaseX-specific functions, as they are more straightforward to use.
Thanks for your remaining suggestions; they could be a useful resource for future extensions.
Christian
However (this is a separate bug), the document-node() parent/child ordering is messed up:
I just noticed that I unwittingly ignored your second remark. I’ll have a look at this issue soon (it could be related to issue 588 [1], which was recently brought up by Hans-Jürgen Rennau on this list [2]).
[1] https://mailman.uni-konstanz.de/pipermail/basex-talk/2012-October/004035.htm... [2] https://github.com/BaseXdb/basex/issues/588
basex-talk@mailman.uni-konstanz.de