Initial observation on BaseX 6.0 release - BaseX-Talk - mailman.uni-konstanz.de

10 Jan 2010


      I've been looking forward to the new release because I really want to 
use an XQuery Fulltext system for a search engine for a reasonably 
large (200,000 printed page equivalent) XML structured text-oriented 
database. However, my initial test, on a large fraction of the total 
database (certainly over 120,000 pages printed equivalent) yields 
disappointing results.
The XML schema is roughly based on Docbook, so it is not unusual. 
Most of the actual text is within PARA elements.
My test query was:
//para[. ftcontains "sumter"]
On a dual quad Xeon system the query took about 28 seconds to run. 
This is completely unusable in my context.
I am pretty sure I know the reason for the poor performance. The 
derived query plan resolves to:
<QueryPlan>
   <IterPath>
     <Root/>
     <IterStep axis="descendant" test="*:para">
       <FTContains>
         <Context/>
         <FTWords>sumter</FTWords>
       </FTContains>
     </IterStep>
   </IterPath>
</QueryPlan>
Now, there are *millions* of PARA elements in the database - but not 
so many (hundreds) of references to the word "sumter".  And this is 
by no means an uncommon XML structure for a text-oriented system.
The problem is obvious. There are an absolutely enormous number of 
PARA elements and very few actual text hits. The IterStep is 
evidently hitting every single PARA element. The query optimization 
should be checking the fulltext search index and using those results 
as an initial filter. Then the IterStep over PARAs from that filtered 
set would be highly productive.
Is there a way to get that kind of (properly) optimized query from 
the system, or is this hopelessly built into the current architecture?
I do hope that I get some kind of response this time, because at 
least two previous queries on earlier releases of BaseX went unanswered.