Re: [basex-talk] fulltext queries that span elements / reducing result sets to most specific matches

10 May 2010


      Dear Gerrit,
thanks for your accurate observations.
...
text()[. contains text 'CO']
takes approx. 5 seconds and returns approx. 900 results.
A query for
 *[. contains text 'CO']
takes 27 seconds (and returns 1600 results).
Yes, we are aware of this phenomena – and, as far as I can judge, this
query will always be expensive, both for BaseX and other
implementations: If you use the asterisk-dot combination, all subnodes
of all document elements will be completely materialized and then
checked for the full text terms. This will be most expensive for the
root node (…the complete document is atomized for this one).
A simple optimization is to include a [text()] predicate to only test
elements that have at least one text node as child. This might
represent a simple solution for your second line of thought (although
it's somewhat simpler than your proposed solution):
//*[text()][. contains text 'CO']
To get interactive query times – which is probably what you need for
larger XML instances – you will need to formulate queries that benefit
from the full text index.
//*[text() contains text 'CO']
Indeed, as you observed, these types of queries do not span element
boundaries. We have developed some first ideas on how your original
query…
//*[. contains text 'CO']
…could be generalized and optimized for index access, but I guess that
a clean implementation would take quite some time and more
consideration.
Feel free to ask for more,
Christian
___________________________
Christian Gruen
Universitaet Konstanz
Department of Computer & Information Science
D-78457 Konstanz, Germany
Tel: +49 (0)7531/88-4449, Fax: +49 (0)7531/88-3577
http://www.inf.uni-konstanz.de/~gruen

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

Re: [basex-talk] fulltext queries that span elements / reducing result sets to most specific matches