Dear Gerrit,
thanks for your accurate observations.
text()[. contains text 'CO'] takes approx. 5 seconds and returns approx. 900 results. A query for *[. contains text 'CO'] takes 27 seconds (and returns 1600 results).
Yes, we are aware of this phenomena – and, as far as I can judge, this query will always be expensive, both for BaseX and other implementations: If you use the asterisk-dot combination, all subnodes of all document elements will be completely materialized and then checked for the full text terms. This will be most expensive for the root node (…the complete document is atomized for this one).
A simple optimization is to include a [text()] predicate to only test elements that have at least one text node as child. This might represent a simple solution for your second line of thought (although it's somewhat simpler than your proposed solution):
//*[text()][. contains text 'CO']
To get interactive query times – which is probably what you need for larger XML instances – you will need to formulate queries that benefit from the full text index.
//*[text() contains text 'CO']
Indeed, as you observed, these types of queries do not span element boundaries. We have developed some first ideas on how your original query…
//*[. contains text 'CO']
…could be generalized and optimized for index access, but I guess that a clean implementation would take quite some time and more consideration.
Feel free to ask for more, Christian ___________________________
Christian Gruen Universitaet Konstanz Department of Computer & Information Science D-78457 Konstanz, Germany Tel: +49 (0)7531/88-4449, Fax: +49 (0)7531/88-3577 http://www.inf.uni-konstanz.de/~gruen