I've been looking forward to the new release because I really want to use an XQuery Fulltext system for a search engine for a reasonably large (200,000 printed page equivalent) XML structured text-oriented database. However, my initial test, on a large fraction of the total database (certainly over 120,000 pages printed equivalent) yields disappointing results.
The XML schema is roughly based on Docbook, so it is not unusual. Most of the actual text is within PARA elements.
My test query was:
//para[. ftcontains "sumter"]
On a dual quad Xeon system the query took about 28 seconds to run. This is completely unusable in my context.
I am pretty sure I know the reason for the poor performance. The derived query plan resolves to:
<QueryPlan> <IterPath> <Root/> <IterStep axis="descendant" test="*:para"> <FTContains> <Context/> <FTWords>sumter</FTWords> </FTContains> </IterStep> </IterPath> </QueryPlan>
Now, there are *millions* of PARA elements in the database - but not so many (hundreds) of references to the word "sumter". And this is by no means an uncommon XML structure for a text-oriented system.
The problem is obvious. There are an absolutely enormous number of PARA elements and very few actual text hits. The IterStep is evidently hitting every single PARA element. The query optimization should be checking the fulltext search index and using those results as an initial filter. Then the IterStep over PARAs from that filtered set would be highly productive.
Is there a way to get that kind of (properly) optimized query from the system, or is this hopelessly built into the current architecture?
I do hope that I get some kind of response this time, because at least two previous queries on earlier releases of BaseX went unanswered.
Dear Phil,
thanks for your e-mail; I hope you will receive this answer as, unfortunately, our last responses seem to have got lost.
Your query plan clearly indicates that the full text index is not applied at all. First of all, I recommend to rewrite your query to one of the two following variants:
[1] //para[text() ftcontains "sumter"] [2] //para[.//text() ftcontains "sumter"]
If this does not help, don't hesitate to give us some more feedback.
All the best, Christian ___________________________
Christian Gruen Universitaet Konstanz Department of Computer & Information Science D-78457 Konstanz, Germany Tel: +49 (0)7531/88-4449, Fax: +49 (0)7531/88-3577 http://www.inf.uni-konstanz.de/~gruen
On Sun, Jan 10, 2010 at 11:28 PM, Phil newintellectual@gmail.com wrote:
I've been looking forward to the new release because I really want to use an XQuery Fulltext system for a search engine for a reasonably large (200,000 printed page equivalent) XML structured text-oriented database. However, my initial test, on a large fraction of the total database (certainly over 120,000 pages printed equivalent) yields disappointing results.
The XML schema is roughly based on Docbook, so it is not unusual. Most of the actual text is within PARA elements.
My test query was:
//para[. ftcontains "sumter"]
On a dual quad Xeon system the query took about 28 seconds to run. This is completely unusable in my context.
I am pretty sure I know the reason for the poor performance. The derived query plan resolves to:
<QueryPlan> <IterPath> <Root/> <IterStep axis="descendant" test="*:para"> <FTContains> <Context/> <FTWords>sumter</FTWords> </FTContains> </IterStep> </IterPath> </QueryPlan>
Now, there are *millions* of PARA elements in the database - but not so many (hundreds) of references to the word "sumter". And this is by no means an uncommon XML structure for a text-oriented system.
The problem is obvious. There are an absolutely enormous number of PARA elements and very few actual text hits. The IterStep is evidently hitting every single PARA element. The query optimization should be checking the fulltext search index and using those results as an initial filter. Then the IterStep over PARAs from that filtered set would be highly productive.
Is there a way to get that kind of (properly) optimized query from the system, or is this hopelessly built into the current architecture?
I do hope that I get some kind of response this time, because at least two previous queries on earlier releases of BaseX went unanswered. _______________________________________________ BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk
basex-talk@mailman.uni-konstanz.de