Hello all,
We are using Basex 7.0.2 and using wildcard for full-text search we ran into some problems when it comes to tokenization related issues. Our database contains these entries:
bb (aa)bb bb(cc) (aa)bb(cc)
We ran a test as following with the given results shown in each case:
1- .//value[text() contains text {'.*(bb)'} using wildcards]
returned (aa)bb and (aa)bb(cc)
2- .//value[text() contains text {'.(bb).*'} using wildcards]
returned bb(cc) and (aa)bb(cc)
3- .//value[text() contains text {'(bb)'} using wildcards]
returned (aa)bb and (aa)bb(cc) and bb(cc) and bb
so far so good, but the following case is the weird case:
4- .//value[text() contains text {'.*(bb).*'} using wildcards]
returning only (aa)bb(cc)
Can anyone explain why is the behavior of the last case different? Whereas it should be the most general case , it turns out to be the most exclusive one ? Are we missing something or is it a bug?
Dear Shakila,
thanks for your mail and all details.
4- .//value[text() contains text {'.*(bb).*'} using wildcards] returning only (aa)bb(cc)
The is indeed the correct answer, and can be explained with the general process of how full-text expressions are evaluated: Both the input and query terms are fully "tokenized", e.g., split into several tokens. All non-token-characters (in this case the parentheses) are interpreted as "separators", which means that your query is equivalent to
.//value[text() contains text { '.* bb .*' } using wildcards]
As a result, we have three tokens ".*", "bb" and ".*", which require at least three words in the input text to yield a result. For instance, the following query returns "false" and "true":
'X bb' contains text '.* bb .*' using wildcards, 'X bb X' contains text '.* bb .*' using wildcards
If you need to search for special characters such as parentheses, you'll probably have to resort to the XQuery functions fn:substring() or fn:matches(). What you can do as well: you may first want to use "contains text" to speed up your query and then do some refinement with the results, such as shown here:
for $v in .//value[text() contains text { '.* bb .*' } using wildcards] return $v[matches(text(), "(bb)" ]
Note, however, that full-text queries that start with a wildcard will not be evaluated by the index anyway, which means that a single fn:matches() function may be faster anyway.
Hope this helps, Christian
basex-talk@mailman.uni-konstanz.de