Hi Christian,
I see, that makes sense. Thank you for your explanation, I think I understand better now how it works "in the background" and why it was slow before. The map is a great solution for my original query which is super fast now! Thank you for your help,
Julia
On Tue, 2019-10-29 at 03:56 +0100, Christian Grün wrote:
Hi Julia,
I’m glad to hear the enforce bug fix is working as expected.
I read through your use case, and I believe your observation with regards to the differences of 'abc' and 'def' was a good hint:
The db:attribute function yields all attribute nodes of a database that match the specified string in a first step, and reduces them to the attributes with the corresponding name in a second step. In the following (worst) case example…
<doc id='1'> <ref idref='1'/> .... 998 more times <ref idref='1'/> </doc>
…the db:attribute('db', '1', 'id') function will receive 1000 attribute nodes from the index (whis very fast). In the subsequent name tests, the targeted attributes will be requested from the database, and only one attribute will be accepted and returned as result.
In your 'abc' document, most resource ids seem to be identical, so my guess is that your index lookup leads to a large number of hits that will be checked with each db:attribute call.
Maybe your query can be sped up with a map:
let $resources := map:merge( for $r in db:open($db_name)/rdf:RDF/edm:WebResource return map:entry($r/@rdf:about, $r) ) for $agg in db:open($db_name)/rdf:RDF/ore:Aggregation for $urn in $agg/edm:object/@rdf:resource return $resources($urn)
One more solution could be to restrict the attribute index to specific attributes (e.g., rdf:about) via the ATTRINCLUDE option.
Cheers, Christian
PS: If my explanations were too cryptical, feel free to ask for enlightenment.
On Fri, Oct 25, 2019 at 4:24 PM Beck, Julia < J.Beck@ub.uni-frankfurt.de> wrote:
Hi,
first of all: thank you, the fix for [1] did the trick and in 9.2.4 the query is working as expected. Today, I come back to you with another challenge in performance which again seems to have something to do with indexing(?). So here's the situation: I have two databases "abc" and "def". "abc" contains 1 xml doc with about 150.000 nodes and "def" contains 1 xml doc with about 400.000 nodes. Both are similarly strutured and have their up-to-date text and attr indexes. The xml docs look both (simplified) like the following:
rdf:RDF <ore:Aggregation rdf:about="123"> <edm:object rdf:resource="urn1"/> <...> </ore:Aggregation> <edm:WebResource rdf:about="urn1"> <...> </edm:WebResource> <ore:Aggregation rdf:about="124"> <edm:object rdf:resource="urn2"/> <...> </ore:Aggregation> <edm:WebResource rdf:about="urn2"> <...> </edm:WebResource> <ore:Aggregation rdf:about="125"> <edm:object rdf:resource="urn2"/> <edm:object rdf:resource="urn3"/> <...> </ore:Aggregation> <edm:WebResource rdf:about="urn3"> <...> </edm:WebResource> <...> </rdf:RDF>
So one aggregation refers to one (or more) web resources. I boiled down my original query to the following purpose to keep it simple: for each aggregation give me the corresponding web resource.
for $agg in db:open($db_name)/rdf:RDF/ore:Aggregation return for $urn in $agg/edm:object/@rdf:resource return (# db:enforceindex #) {db:open($db_name)/rdf:RDF/edm:WebResource[@rdf:about=$urn]}
For both databases the query gives me the required result and the query info tells me that the attribute index for $urn is applied in both cases (this is also the case if I leave out the pragma). However, oddly enough, for the "larger" database "def" with a larger attribute index it takes roughly 1 second while the "smaller" database "abc" with a smaller attribute index takes 20 seconds. This is not very long but the original query is more complicated and I have bigger databases with the same structure where it starts to matter.
The only (and I think important) difference between "abc" and "def" is that "abc" contains only 3 web resources that all 5.000 aggregations refer to. While in "def" each aggregation refers to a particular web resource (== 7.000 aggregations and 7.000 web resources).
With index:facets I had a look at the facet values and learned that there is a "maximum number of distinct values to store per name". Is there a difference in performance because of that? Maybe I do not get the index structures but it feels strange that it takes longer to find the correct attribute in a range of 3 different values than in a range of 7.000. Maybe there is also another problem in my query, databases or my reasoning that I do not see? Either way, I need help in understanding this phenomenon :-)
I hope you could follow, please don't hesitate to ask if you need anything to reproduce this situation (I am using BaseX 9.2.4).
Julia
[1] https://mailman.uni-konstanz.de/pipermail/basex-talk/2019-July/014511.html