New subject: fulltext queries that span elements / reducing result sets to most specific matches

10 May 2010

      Dear Team,
I found that fulltext queries that do not retrieve text nodes are 
significantly slower than querying text nodes as in
   text()[. contains text 'sometext']
Example: There's some sample data that may contain both CO<sub>2</sub> 
and CO2. I want a query that returns both kinds of results.
With my sample data of approx. 2700 documents, 70 MB, 2.6M nodes, height 
18, a query for
   text()[. contains text 'CO']
takes approx. 5 seconds and returns approx. 900 results.
A query for
   *[. contains text 'CO']
takes 27 seconds (and returns 1600 results).
Now
   *[. contains text 'CO2']
also takes 27 seconds and returns 1300 results.
Since the subscript is properly marked up in the sample documents,
   text()[. contains text 'CO2']
returns 0 results.
Can you think of a way that will speed up fulltext queries to element 
content, i.e., the *[. contains text '...'] scenario?
--------
A related addition (thereby violating the "one topic, one mailing list 
posting" dogma slightly):
Given a document <doc><sect><p>CO<sub>2</sub></p></sect></doc>, the 
query *[. contains text 'CO2'] will contain 3 results: doc, sect, and p.
But I'm only interested in the "shortest possible" or "most specific" 
result, i.e., a result that does not contain any other element 
satisfying the same query. In above example, the shortest possible 
result will be p.
This may be achieved by the following query:
let $prelim := //*[. contains text 'CO2' ]
for $d in $prelim[every $c in * satisfies empty($c intersect $prelim)]
return <result>{ $d }</result>
The predicate [every $c in * satisfies empty($c intersect $prelim)] 
doesn't slow down the query significantly, at least with BaseX operating 
on my sample data. Thus it's important to optimize the initial //*[. 
contains text 'CO2' ] query.
Would it help if BaseX knew that we only need the innermost element 
satisfying the FT query?
So if there were, for example, an option
//*[. contains text 'CO2' using option basex:element-scope "most-specific"],
could BaseX use this in order to make use of its indexes and accelerate 
the query?
I'm not sure, because if some text occurs within a certain element way 
down the tree, a distinct occurence of same text may be found higher up 
in the tree, immediately below an ancestor element:
<a>text <b>foo <c>text</c> bar</b></a>
So it's important to look at all the elements in the tree, with no speed 
gain if you look at the bottom-most elements first.
So I think it boils down to the issue of accelerating fulltext queries 
that span element boundaries.
------
You could argue that if I knew in advance which elements to search and 
use specific elements instead of *, optimization will be easier (so do 
the qizx people, if I remember correctly).
But there are different document types in the DB (other types may be 
added at any time without prior notice), so it's hard to specify in 
advance which elements should be queried/returned.
And even if you focused your query on, e.g., para elements in DocBook 
documents, there are situations like
<para><orderedlist><listitem><para>CO<subscript>2</subscript></para></listitem></orderedlist></para>
where you cannot restrict a query to paras because you will get the 
occuring text twice, in the absence of additional filtering.
Or even worse with CALS tables in DocBook: an entry may or may not 
contain a para. Searching for paras that contain some text will return 
(1) the para that contains the whole table plus (2) a para within this 
table ONLY IF the author chose to mark up the inner para (which he 
doesn't have to).
Filtering the results or using the fictitious "most-specific" scope 
option, the result will be narrowed down to either the entry or, if 
present, the para within the entry. You don't need to filter
*[self::para[not(descendant::tgroup or descendant::orderedlist or ...)] 
or self::entry or self::simpara or ...][. contains text 'CO2']
wihch might ultimately work for DocBook but not for the general case (in 
our environment, there will be DocBook, XHTML, dozens of 
customer-specific schema variants and an unpredictable number of 
3rd-party schemas).
------
This message has become quite lengthy. I hope you don't mind.
Gerrit
-- 
Gerrit Imsieke
Geschäftsführer / Managing Director
le-tex publishing services GmbH
Weissenfelser Str. 84, 04229 Leipzig, Germany
Phone +49 341 355356 110, Fax +49 341 355356 510
gerrit.imsieke@le-tex.de, http://www.le-tex.de

Registergericht / Commercial Register: Amtsgericht Leipzig
Registernummer / Registration Number: HRB 24930

Geschäftsführer: Gerrit Imsieke, Svea Jelonek,
Thomas Schmidt, Dr. Reinhard Vöckler