Dear Team,
I found that fulltext queries that do not retrieve text nodes are significantly slower than querying text nodes as in text()[. contains text 'sometext']
Example: There's some sample data that may contain both CO<sub>2</sub> and CO2. I want a query that returns both kinds of results.
With my sample data of approx. 2700 documents, 70 MB, 2.6M nodes, height 18, a query for text()[. contains text 'CO'] takes approx. 5 seconds and returns approx. 900 results. A query for *[. contains text 'CO'] takes 27 seconds (and returns 1600 results).
Now *[. contains text 'CO2'] also takes 27 seconds and returns 1300 results. Since the subscript is properly marked up in the sample documents, text()[. contains text 'CO2'] returns 0 results.
Can you think of a way that will speed up fulltext queries to element content, i.e., the *[. contains text '...'] scenario?
--------
A related addition (thereby violating the "one topic, one mailing list posting" dogma slightly):
Given a document <doc><sect><p>CO<sub>2</sub></p></sect></doc>, the query *[. contains text 'CO2'] will contain 3 results: doc, sect, and p.
But I'm only interested in the "shortest possible" or "most specific" result, i.e., a result that does not contain any other element satisfying the same query. In above example, the shortest possible result will be p.
This may be achieved by the following query:
let $prelim := //*[. contains text 'CO2' ] for $d in $prelim[every $c in * satisfies empty($c intersect $prelim)] return <result>{ $d }</result>
The predicate [every $c in * satisfies empty($c intersect $prelim)] doesn't slow down the query significantly, at least with BaseX operating on my sample data. Thus it's important to optimize the initial //*[. contains text 'CO2' ] query.
Would it help if BaseX knew that we only need the innermost element satisfying the FT query?
So if there were, for example, an option //*[. contains text 'CO2' using option basex:element-scope "most-specific"], could BaseX use this in order to make use of its indexes and accelerate the query?
I'm not sure, because if some text occurs within a certain element way down the tree, a distinct occurence of same text may be found higher up in the tree, immediately below an ancestor element: <a>text <b>foo <c>text</c> bar</b></a> So it's important to look at all the elements in the tree, with no speed gain if you look at the bottom-most elements first.
So I think it boils down to the issue of accelerating fulltext queries that span element boundaries.
------
You could argue that if I knew in advance which elements to search and use specific elements instead of *, optimization will be easier (so do the qizx people, if I remember correctly).
But there are different document types in the DB (other types may be added at any time without prior notice), so it's hard to specify in advance which elements should be queried/returned.
And even if you focused your query on, e.g., para elements in DocBook documents, there are situations like <para><orderedlist><listitem><para>CO<subscript>2</subscript></para></listitem></orderedlist></para> where you cannot restrict a query to paras because you will get the occuring text twice, in the absence of additional filtering. Or even worse with CALS tables in DocBook: an entry may or may not contain a para. Searching for paras that contain some text will return (1) the para that contains the whole table plus (2) a para within this table ONLY IF the author chose to mark up the inner para (which he doesn't have to).
Filtering the results or using the fictitious "most-specific" scope option, the result will be narrowed down to either the entry or, if present, the para within the entry. You don't need to filter *[self::para[not(descendant::tgroup or descendant::orderedlist or ...)] or self::entry or self::simpara or ...][. contains text 'CO2'] wihch might ultimately work for DocBook but not for the general case (in our environment, there will be DocBook, XHTML, dozens of customer-specific schema variants and an unpredictable number of 3rd-party schemas).
------
This message has become quite lengthy. I hope you don't mind.
Gerrit