Sandra,
I'm glad to tell you that we have put some additional work into our query optimizer. Your query, which was using as index terms, should now be recognized by the compiler and evaluated by the index. You are welcome to check out the latest sources from our repository (note that the current code is still at a beta stage, so any feedback is more than welcome).
Hope this helps, Christian ___________________________
Christian Gruen Universitaet Konstanz Department of Computer & Information Science D-78457 Konstanz, Germany Tel: +49 (0)7531/88-4449, Fax: +49 (0)7531/88-3577 http://www.inf.uni-konstanz.de/~gruen
On Sun, Mar 28, 2010 at 5:30 AM, Christian Grün christian.gruen@gmail.com wrote:
Sandra,
yes, it feels necessary to put some additional work in the optimizer to support queries like the one you detailed. In a nutshell, we'll think about a generic way to rewrite the index methods to also support arguments other than strings and atomic values (...such as variables, or item sequences). While some optimizations look quite obvious on paper, it has turned out in the past that they mean a lot of work in the final compilation steps, as XQuery is much more flexible than e.g. SQL, or strictly typed lanaguges. Still… be sure your concerns are not in vain.
Feedback is always welcome, Christian
On Sat, Mar 27, 2010 at 11:43 PM, Sandra Maria Silcot ssilcot@unimelb.edu.au wrote:
Christian,
Your suggestion to add "let $person := //person" does speed things up considerably, down from a minute to about 10 seconds. I understand the reasons for your optimization logic. For example, when I tried to halve the no. of //person elements scanned, limiting it to males:
let $person := //person[@sex="M"]
the overhead of that check slightly increased the time needed!
However, is a large xml database and wanting to "join" documents using what are in effect unique keys such a "special case"? I am wondering if a different decision could be applied by the optimiser, fairly simply, based on db size (say the target indexed attribute or element occurs > 10,000 times)?
I would also suggest that a useful (necessary?) optimisation enhancement is that indexes should always be used when xml:id / xml:idref attributes are involved, because these must always represent unique identifiers for well formed xml. I modified my query slightly to try and target elements on their xml:id ...
let $person:= //sources/* for $rs in //rs[@corresp="om22451"]/../..//rs let $keyval := data($rs/@corresp) return $person[@xml:id=$keyval]
And got the same time, about 10 seconds. FYI, here is the query info result:
Result: let $person := root()/descendant::*:sources/* for $rs in IndexAccess(ATV,"om22451")/self::*:corresp/parent::*:rs/../../descendant::*:rs let $keyval := data($rs/@*:corresp) return $person[@xml:id = $keyval]
I concede that in 4-6 years, Moores law will get this query down to 2-3 seconds, but in that time, the database may will have grown similarly! Btw, running BaseX6.jar on a 1.8MHz Core2Duo 2.5gb ram, assigning the JVM 1024M, just an average kind of machine, but not too far short of our server's cpu speed.
Thanks for your reply. I have one other semi-related question regards how to address the separate documents in the db, but I'll post separately on that.
Many thanks again.
Sandra.
Sandra,
thanks for your comprehensive analysis. It's true, the BaseX query
compiler optimises only static equality comparisons. If a dynamic variable is embedded in a predicate, we would have to decide in
runtime if we want to apply the index, or not. The main reason why we
don't use runtime optimisations here is that there are many cases in which sequential executions turn out to be faster (e.g. if the path to a predicate is cheap), and it's difficult to decide which variant will yield faster results. In your special case, however, it would seem quite obvious that the index would be preferable.
Apart from that, you may try to speed up your given query by putting the
//person into a variable:
let $person := //person for $rs in //rs[@corresp="c31a31061000"]/../..//rs let $keyval := data($rs/@corresp) return $person[@key=$keyval]
By the way, the use of the eval() method is an interesting (implementation specific) option which didn't come to my mind before�
Feel free to ask for more, Christian
On Sat, Mar 27, 2010 at 2:28 AM, Sandra Maria Silcot ssilcot@unimelb.edu.au wrote:
Hi all, First, thanks to the developers for a great piece of software. I am having difficulty getting an xquery on a large database to run
using
indexed attributes when a "join" idiom is used. I have a large basex
database, with multiple documents. One of those contains <rs> elements as
shown below, where the @corresp attribute contains values which are
identical to the @key attribute on <person> elements, which live in multiple separate files:
<personGrp type="match:policeNum+ship" size="3"><persName> <rs corresp="c23a2866">Corper, Jno (Pn:1000C)...</rs> <rs corresp="dlm18192024">Corper, John (Pn:1000C)...</rs> <rs corresp="c31a31061000" >Corper, Jno (Pn:1000CC)...</rs> </persName></personGrp> I know indexes have been built as xpaths like this are lightening
quick:
//person[@key="c23a2866"] or //rs[@corresp="dlm18192024"] But when I do this, it is glacial (nearly 1 minute): EG(A) for $rs in //rs[@corresp="c31a31061000"]/../..//rs let $keyval := data($rs/@corresp) return //person[@key=$keyval] I am using BaseX6.jar on XP. When I look at the query plan, the ONLY
time
the attribute index used is on the //rs[@corresp="c31a31061000"] part.
I can get it to run fast and return the 3 matched <person> elements using
the attribute index using basex:eval, like this: EG(B) for $rs in data(//rs[@corresp="c31a31061000"]/../..//rs/@corresp) let
$s := concat("//person[@key='",$rs,"']")
return basex:eval($s) So rather than execute this query asa "join" -- a manner which seem
widespread in the xquery world -- I must manually build and execute each
//person[@key='string'] request "manually" to get basex to use its
attribute index. Whilst this works, it seems a rather strange idiom to have to employ, and locks my queries into basex.
Is the behaviour of EG(A) by design, or is it a bug that the query
optimizer is failing to recognise it can use the attribute index on the //person[@key=$keyval] part?
Any guidance much appreciated. Best wishes to all, Sandra.
Christian Gruen Universitaet Konstanz Department of Computer & Information Science D-78457 Konstanz, Germany Tel: +49 (0)7531/88-4449, Fax: +49 (0)7531/88-3577 http://www.inf.uni-konstanz.de/~gruen