Dear Team,
I found that fulltext queries that do not retrieve text nodes are significantly slower than querying text nodes as in text()[. contains text 'sometext']
Example: There's some sample data that may contain both CO<sub>2</sub> and CO2. I want a query that returns both kinds of results.
With my sample data of approx. 2700 documents, 70 MB, 2.6M nodes, height 18, a query for text()[. contains text 'CO'] takes approx. 5 seconds and returns approx. 900 results. A query for *[. contains text 'CO'] takes 27 seconds (and returns 1600 results).
Now *[. contains text 'CO2'] also takes 27 seconds and returns 1300 results. Since the subscript is properly marked up in the sample documents, text()[. contains text 'CO2'] returns 0 results.
Can you think of a way that will speed up fulltext queries to element content, i.e., the *[. contains text '...'] scenario?
--------
A related addition (thereby violating the "one topic, one mailing list posting" dogma slightly):
Given a document <doc><sect><p>CO<sub>2</sub></p></sect></doc>, the query *[. contains text 'CO2'] will contain 3 results: doc, sect, and p.
But I'm only interested in the "shortest possible" or "most specific" result, i.e., a result that does not contain any other element satisfying the same query. In above example, the shortest possible result will be p.
This may be achieved by the following query:
let $prelim := //*[. contains text 'CO2' ] for $d in $prelim[every $c in * satisfies empty($c intersect $prelim)] return <result>{ $d }</result>
The predicate [every $c in * satisfies empty($c intersect $prelim)] doesn't slow down the query significantly, at least with BaseX operating on my sample data. Thus it's important to optimize the initial //*[. contains text 'CO2' ] query.
Would it help if BaseX knew that we only need the innermost element satisfying the FT query?
So if there were, for example, an option //*[. contains text 'CO2' using option basex:element-scope "most-specific"], could BaseX use this in order to make use of its indexes and accelerate the query?
I'm not sure, because if some text occurs within a certain element way down the tree, a distinct occurence of same text may be found higher up in the tree, immediately below an ancestor element: <a>text <b>foo <c>text</c> bar</b></a> So it's important to look at all the elements in the tree, with no speed gain if you look at the bottom-most elements first.
So I think it boils down to the issue of accelerating fulltext queries that span element boundaries.
------
You could argue that if I knew in advance which elements to search and use specific elements instead of *, optimization will be easier (so do the qizx people, if I remember correctly).
But there are different document types in the DB (other types may be added at any time without prior notice), so it's hard to specify in advance which elements should be queried/returned.
And even if you focused your query on, e.g., para elements in DocBook documents, there are situations like <para><orderedlist><listitem><para>CO<subscript>2</subscript></para></listitem></orderedlist></para> where you cannot restrict a query to paras because you will get the occuring text twice, in the absence of additional filtering. Or even worse with CALS tables in DocBook: an entry may or may not contain a para. Searching for paras that contain some text will return (1) the para that contains the whole table plus (2) a para within this table ONLY IF the author chose to mark up the inner para (which he doesn't have to).
Filtering the results or using the fictitious "most-specific" scope option, the result will be narrowed down to either the entry or, if present, the para within the entry. You don't need to filter *[self::para[not(descendant::tgroup or descendant::orderedlist or ...)] or self::entry or self::simpara or ...][. contains text 'CO2'] wihch might ultimately work for DocBook but not for the general case (in our environment, there will be DocBook, XHTML, dozens of customer-specific schema variants and an unpredictable number of 3rd-party schemas).
------
This message has become quite lengthy. I hope you don't mind.
Gerrit
Dear Gerrit,
thanks for your accurate observations.
text()[. contains text 'CO'] takes approx. 5 seconds and returns approx. 900 results. A query for *[. contains text 'CO'] takes 27 seconds (and returns 1600 results).
Yes, we are aware of this phenomena – and, as far as I can judge, this query will always be expensive, both for BaseX and other implementations: If you use the asterisk-dot combination, all subnodes of all document elements will be completely materialized and then checked for the full text terms. This will be most expensive for the root node (…the complete document is atomized for this one).
A simple optimization is to include a [text()] predicate to only test elements that have at least one text node as child. This might represent a simple solution for your second line of thought (although it's somewhat simpler than your proposed solution):
//*[text()][. contains text 'CO']
To get interactive query times – which is probably what you need for larger XML instances – you will need to formulate queries that benefit from the full text index.
//*[text() contains text 'CO']
Indeed, as you observed, these types of queries do not span element boundaries. We have developed some first ideas on how your original query…
//*[. contains text 'CO']
…could be generalized and optimized for index access, but I guess that a clean implementation would take quite some time and more consideration.
Feel free to ask for more, Christian ___________________________
Christian Gruen Universitaet Konstanz Department of Computer & Information Science D-78457 Konstanz, Germany Tel: +49 (0)7531/88-4449, Fax: +49 (0)7531/88-3577 http://www.inf.uni-konstanz.de/~gruen
basex-talk@mailman.uni-konstanz.de