Adding a text() step to the predicate does dramatically reduce the execution time.
This query:
declare namespace marc="http://www.loc.gov/MARC21/slim";
for $m in collection( "latin_hold_20150730" )/marc:collection/marc:record, $r in collection( "latin_hold_20150730" )/root/row[BIB_ID/text() = $m/marc:controlfield[@tag = '001']] group by $key := $r/ITEM_ID
return <test n="{ $key }">{ $m/marc:datafield[@tag = '245']/marc:subfield[@code = 'a']/text() }</test>
Executes in 257.13 ms. But it shouldn't be necessary to explicitly specify the text() node, right (or would even be bad practice to do so[1])?
Thanks, Tim
[1] http://stackoverflow.com/questions/12970162/xquery-join-query#comment-175921...
-- Tim A. Thompson Metadata Librarian (Spanish/Portuguese Specialty) Princeton University Library
On Tue, Aug 4, 2015 at 8:25 AM, Tim Thompson timathom@gmail.com wrote:
I was mistaken; I hadn't enabled indexes on this database. However, after creating attribute and text indexes, the query actually seems to take longer to execute (483622.95 ms on last run), although the query plan itself doesn't seem to have changed:
Compiling:
- pre-evaluating collection("latin_hold_20150730")
- pre-evaluating collection("latin_hold_20150730")
Query: declare namespace marc="http://www.loc.gov/MARC21/slim"; for $m in collection( "latin_hold_20150730" )/marc:collection/marc:record, $r in collection( "latin_hold_20150730" )/root/row[BIB_ID = $m/marc:controlfield[@tag = '001']] group by $key := $r/ITEM_ID return <test n="{ $key }">{ $m/marc:datafield[@tag = '245']/marc:subfield[@code = 'a']/string() }</test> Optimized Query: for $m_0 in (db:open-pre("latin_hold_20150730",0), ...)/marc:collection/marc:record for $r_1 in (db:open-pre("latin_hold_20150730",0), ...)/root/row[(BIB_ID = $m_0/marc:controlfield[(@tag = "001")])] let (: post-group :) $m_4 := $m_0 group by $key_2 := $r_1/ITEM_ID return element test { (attribute n { ($key_2) }, $m_4/marc:datafield[(@tag = "245")]/marc:subfield[(@code = "a")]/string()) } Result:
- Hit(s): 7587 Items
- Updated: 0 Items
- Printed: 505 KB
- Read Locking: local [latin_hold_20150730]
- Write Locking: none
Timing:
- Parsing: 1.04 ms
- Compiling: 0.86 ms
- Evaluating: 483604.85 ms
- Printing: 16.2 ms
- Total Time: 483622.95 ms
Query plan:
<QueryPlan compiled="true"> <GFLWOR> <For> <Var name="$m" id="0"/> <IterPath> <DBNodeSeq size="2"> <DBNode name="latin_hold_20150730" pre="0"/> <DBNode name="latin_hold_20150730" pre="365692"/> </DBNodeSeq> <IterStep axis="child" test="marc:collection"/> <IterStep axis="child" test="marc:record"/> </IterPath> </For> <For> <Var name="$r" id="1"/> <IterPath> <DBNodeSeq size="2"> <DBNode name="latin_hold_20150730" pre="0"/> <DBNode name="latin_hold_20150730" pre="365692"/> </DBNodeSeq> <IterStep axis="child" test="root"/> <IterStep axis="child" test="row"> <CmpG op="="> <CachedPath> <IterStep axis="child" test="BIB_ID"/> </CachedPath> <IterPath> <VarRef> <Var name="$m" id="0"/> </VarRef> <IterStep axis="child" test="marc:controlfield"> <CmpG op="="> <CachedPath> <IterStep axis="attribute" test="tag"/> </CachedPath> <Str value="001" type="xs:string"/> </CmpG> </IterStep> </IterPath> </CmpG> </IterStep> </IterPath> </For> <GroupBy> <Spec> <Var name="$key" id="2"/> <IterPath> <VarRef> <Var name="$r" id="1"/> </VarRef> <IterStep axis="child" test="ITEM_ID"/> </IterPath> </Spec> </GroupBy> <CElem> <QNm value="test" type="xs:QName"/> <CAttr> <QNm value="n" type="xs:QName"/> <VarRef> <Var name="$key" id="2"/> </VarRef> </CAttr> <MixedPath> <VarRef> <Var name="$m" id="4"/> </VarRef> <IterStep axis="child" test="marc:datafield"> <CmpG op="="> <CachedPath> <IterStep axis="attribute" test="tag"/> </CachedPath> <Str value="245" type="xs:string"/> </CmpG> </IterStep> <IterStep axis="child" test="marc:subfield"> <CmpG op="="> <CachedPath> <IterStep axis="attribute" test="code"/> </CachedPath> <Str value="a" type="xs:string"/> </CmpG> </IterStep> <FnString name="string([item])"/> </MixedPath> </CElem> </GFLWOR> </QueryPlan>
-- Tim A. Thompson Metadata Librarian (Spanish/Portuguese Specialty) Princeton University Library
On Tue, Aug 4, 2015 at 3:23 AM, Christian Grün christian.gruen@gmail.com wrote:
Dear Tim,
The query plan indicates that no index is applied. Your query may be evaluated faster when rewriting "BIB_ID" to "BIB_ID/text()". I will see if this can automatically be done by the query compiler.
Best, Christian
On Tue, Aug 4, 2015 at 4:42 AM, Tim Thompson timathom@gmail.com wrote:
Hello,
I have a database containing two resources/documents: they both
represent
the same set of library catalog records (7728 "records" in each), but
they
each contain different data that I want to join.
The first resource looks like this:
<marc:collection xmlns:marc="http://www.loc.gov/MARC21/slim"> <marc:record> <marc:leader>01225cam a2200373Mi 4500</marc:leader> <marc:controlfield tag="001">5323084</marc:controlfield> <marc:datafield ind1="1" ind2="4" tag="245"> <marc:subfield code="a">Els teleclubs a les illes
Balears
:</marc:subfield> </marc:datafield> </marc:record> marc:record marc:leader01225cam a2200373Mi 4500</marc:leader> <marc:controlfield tag="001">5323084</marc:controlfield> <marc:datafield ind1="1" ind2="4" tag="245"> <marc:subfield code="a">Els teleclubs a les illes
Balears
:</marc:subfield> </marc:datafield> </marc:record> marc:record marc:leader00818cam a2200241Mi 4500</marc:leader> <marc:controlfield tag="001">6310976</marc:controlfield> <marc:datafield ind1="0" ind2="0" tag="245"> <marc:subfield code="a">Diccionari manual de sinònims i antònims de la llengua catalana /</marc:subfield> </marc:datafield> </marc:record> </marc:collection>
The second one looks like this:
<root> <row> <LANGUAGE>cat</LANGUAGE> <ITEM_ID>5912416</ITEM_ID> <BIB_ID>5323084</BIB_ID> <VENDOR_CODE>MXBKSMX</VENDOR_CODE> </row> <row> <LANGUAGE>cat</LANGUAGE> <ITEM_ID>5912416</ITEM_ID> <BIB_ID>5323084</BIB_ID> <VENDOR_CODE>PUVILL</VENDOR_CODE> </row> <row> <LANGUAGE>cat</LANGUAGE> <ITEM_ID>5935043</ITEM_ID> <BIB_ID>6310976</BIB_ID> <VENDOR_CODE>PUVILL</VENDOR_CODE> </row> </root>
I have a simple query that joins the two using the value of the marc:controlfield[@tag = '001'] from resource 1 and the BIB_ID from
resource
The query:
declare namespace marc="http://www.loc.gov/MARC21/slim";
for $m in collection( "latin_hold_20150730" )/marc:collection/marc:record, $r in collection( "latin_hold_20150730" )/root/row[BIB_ID = $m/marc:controlfield[@tag = '001']] group by $key := $r/ITEM_ID
return <test n="{ $key }">{ $m/marc:datafield[@tag = '245']/marc:subfield[@code = 'a']/string() }</test>
When I run this using Saxon (substituting fn:doc for fn:collection), it
only
takes a second to execute. In BaseX 8.2.3 (using the GUI), however, it
takes
a very long time to execute (around 6 minutes!). The BaseX database has
both
attribute and text indexes enabled.
Any idea what is causing it to take so long?
Here is the BaseX Query Info:
Total Time: 383756.3 ms
Compiling:
- pre-evaluating collection("latin_hold_20150730")
- pre-evaluating collection("latin_hold_20150730")
Query: declare namespace marc="http://www.loc.gov/MARC21/slim"; for $m in collection( "latin_hold_20150730" )/marc:collection/marc:record, $r in collection( "latin_hold_20150730" )/root/row[BIB_ID = $m/marc:controlfield[@tag = '001']] group by $key := $r/ITEM_ID return
<test
n="{ $key }">{ $m/marc:datafield[@tag = '245']/marc:subfield[@code = 'a']/string() }</test> Optimized Query: for $m_0 in (db:open-pre("latin_hold_20150730",0), ...)/marc:collection/marc:record for $r_1 in (db:open-pre("latin_hold_20150730",0), ...)/root/row[(BIB_ID = $m_0/marc:controlfield[(@tag = "001")])] let (: post-group :) $m_4 :=
$m_0
group by $key_2 := $r_1/ITEM_ID return element test { (attribute n { ($key_2) }, $m_4/marc:datafield[(@tag = "245")]/marc:subfield[(@code = "a")]/string()) } Result:
- Hit(s): 7587 Items
- Updated: 0 Items
- Printed: 505 KB
- Read Locking: local [latin_hold_20150730]
- Write Locking: none
Timing:
- Parsing: 0.43 ms
- Compiling: 1.96 ms
- Evaluating: 383737.5 ms
- Printing: 16.41 ms
- Total Time: 383756.3 ms
Query plan:
<QueryPlan compiled="true"> <GFLWOR> <For> <Var name="$m" id="0"/> <IterPath> <DBNodeSeq size="2"> <DBNode name="latin_hold_20150730" pre="0"/> <DBNode name="latin_hold_20150730" pre="365692"/> </DBNodeSeq> <IterStep axis="child" test="marc:collection"/> <IterStep axis="child" test="marc:record"/> </IterPath> </For> <For> <Var name="$r" id="1"/> <IterPath> <DBNodeSeq size="2"> <DBNode name="latin_hold_20150730" pre="0"/> <DBNode name="latin_hold_20150730" pre="365692"/> </DBNodeSeq> <IterStep axis="child" test="root"/> <IterStep axis="child" test="row"> <CmpG op="="> <CachedPath> <IterStep axis="child" test="BIB_ID"/> </CachedPath> <IterPath> <VarRef> <Var name="$m" id="0"/> </VarRef> <IterStep axis="child" test="marc:controlfield"> <CmpG op="="> <CachedPath> <IterStep axis="attribute" test="tag"/> </CachedPath> <Str value="001" type="xs:string"/> </CmpG> </IterStep> </IterPath> </CmpG> </IterStep> </IterPath> </For> <GroupBy> <Spec> <Var name="$key" id="2"/> <IterPath> <VarRef> <Var name="$r" id="1"/> </VarRef> <IterStep axis="child" test="ITEM_ID"/> </IterPath> </Spec> </GroupBy> <CElem> <QNm value="test" type="xs:QName"/> <CAttr> <QNm value="n" type="xs:QName"/> <VarRef> <Var name="$key" id="2"/> </VarRef> </CAttr> <MixedPath> <VarRef> <Var name="$m" id="4"/> </VarRef> <IterStep axis="child" test="marc:datafield"> <CmpG op="="> <CachedPath> <IterStep axis="attribute" test="tag"/> </CachedPath> <Str value="245" type="xs:string"/> </CmpG> </IterStep> <IterStep axis="child" test="marc:subfield"> <CmpG op="="> <CachedPath> <IterStep axis="attribute" test="code"/> </CachedPath> <Str value="a" type="xs:string"/> </CmpG> </IterStep> <FnString name="string([item])"/> </MixedPath> </CElem> </GFLWOR> </QueryPlan>
Thanks in advance!
Tim
-- Tim A. Thompson Metadata Librarian (Spanish/Portuguese Specialty) Princeton University Library