Hi, I have another question regarding fulltext search performance with wildcards. Our fulltext index is rather big with 1.3 GB and about 500.000 entries. When I use the following search where there are no matches: *ft:search('db', 'CT020/.*/P', map{'wildcards':true()})* it takes about 10 seconds to get the empty result. The query requires a big amount of free memory. If I use this query here (that also has no matches): *ft:search('db', 'CT.*/20/P', map{'wildcards':true()})* it only takes about some milliseconds. Any ideas why the placement of the wildcard has such big impact on the performance? Best Regards Johannes
Hi Johannes,
It was a courageous decision by the writers of the XQFT spec to include support for wildcards. The resulting semantics turned out to be very sophisticated and sometimes counter-intuitive (with all the gory details being left to the implementors).
Wildcards search will be applied to the tokenized stream of input strings and not to the full string, as one might expect. To start with, see the next three queries:
'a/b' contains text 'a/b' 'a/b' contains text 'a b' 'a/b' contains text 'a.*b'
They all yield true because both the input and query strings are tokenized to 'a' and 'b'. If wildcard search is enabled, …
'a/b' contains text 'a.*b' using wildcards
…you will get false, as none of the input tokens 'a' and 'b' matches the single wildcard token 'a.*b'. Accidentally, the next query is successful, …
'a/b' contains text 'a/.*b' using wildcards
…as '/' is treated as non-letter character both in the input and query string, and is simply ignored. The resulting query tokens will be 'a' and '.*b' instead. 'a' matches 'a' and '.*b' matches 'b', so the input token is accepted.
If the full-text index is used, it is expensive to process wildcards that start with '.*', as all tokens will be checked and only those that end with 'b' will be accepted. A suffix tree index could be used to speed up operations on the end of strings, but that’s currently out of scope (and it wouldn’t give you the expected results anyway).
If you want to search for strings that start with a specific substring, it’s better to put the full-text index aside and go for substring search instead. If a full database scan is too slow, you could have a look at the index:texts function [1]. The results of this function, or parts of it, can be used as input for subsequent string lookups.
Cheers, Christian
[1] https://docs.basex.org/wiki/Index_Module#index:texts
On Wed, Nov 23, 2022 at 11:48 AM Johannes Bauer johannes.bauer@tanner.de wrote:
Hi, I have another question regarding fulltext search performance with wildcards. Our fulltext index is rather big with 1.3 GB and about 500.000 entries. When I use the following search where there are no matches: *ft:search('db', 'CT020/.*/P', map{'wildcards':true()})* it takes about 10 seconds to get the empty result. The query requires a big amount of free memory. If I use this query here (that also has no matches): *ft:search('db', 'CT.*/20/P', map{'wildcards':true()})* it only takes about some milliseconds. Any ideas why the placement of the wildcard has such big impact on the performance? Best Regards Johannes
basex-talk@mailman.uni-konstanz.de