Re: [basex-talk] Understanding the path index

29 Oct 2019


      Hi Christian,
I see, that makes sense. Thank you for your explanation, I think I
understand better now how it works "in the background" and why it was
slow before. The map is a great solution for my original query which is
super fast now!
Thank you for your help,
Julia
On Tue, 2019-10-29 at 03:56 +0100, Christian Grün wrote:
...
Hi Julia,
I’m glad to hear the enforce bug fix is working as expected.
I read through your use case, and I believe your observation with
regards to the differences of 'abc' and 'def' was a good hint:
The db:attribute function yields all attribute nodes of a database
that match the specified string in a first step, and reduces them to
the attributes with the corresponding name in a second step. In the
following (worst) case example…
  <doc id='1'>
    <ref idref='1'/>
    .... 998 more times <ref idref='1'/>
  </doc>
…the db:attribute('db', '1', 'id') function will receive 1000
attribute nodes from the index (whis very fast). In the subsequent
name tests, the targeted attributes will be requested from the
database, and only one attribute will be accepted and returned as
result.
In your 'abc' document, most resource ids seem to be identical, so my
guess is that your index lookup leads to a large number of hits that
will be checked with each db:attribute call.
Maybe your query can be sped up with a map:
let $resources := map:merge(
    for $r in db:open($db_name)/rdf:RDF/edm:WebResource
    return map:entry($r/@rdf:about, $r)
  )
  for $agg in db:open($db_name)/rdf:RDF/ore:Aggregation
  for $urn in $agg/edm:object/@rdf:resource
  return $resources($urn)
One more solution could be to restrict the attribute index to
specific
attributes (e.g., rdf:about) via the ATTRINCLUDE option.
Cheers,
Christian
PS: If my explanations were too cryptical, feel free to ask for
enlightenment.
On Fri, Oct 25, 2019 at 4:24 PM Beck, Julia <
J.Beck@ub.uni-frankfurt.de> wrote:
...
Hi,
first of all: thank you, the fix for [1] did the trick and in 9.2.4
the
query is working as expected.
Today, I come back to you with another challenge in performance
which
again seems to have something to do with indexing(?). So here's the
situation:
I have two databases "abc" and "def". "abc" contains 1 xml doc with
about 150.000 nodes and "def" contains 1 xml doc with about 400.000
nodes. Both are similarly strutured and have their up-to-date text
and
attr indexes. The xml docs look both (simplified) like the
following:
rdf:RDF
  <ore:Aggregation rdf:about="123">
     <edm:object rdf:resource="urn1"/>
     <...>
  </ore:Aggregation>
  <edm:WebResource rdf:about="urn1">
     <...>
  </edm:WebResource>
  <ore:Aggregation rdf:about="124">
     <edm:object rdf:resource="urn2"/>
     <...>
  </ore:Aggregation>
  <edm:WebResource rdf:about="urn2">
     <...>
  </edm:WebResource>
  <ore:Aggregation rdf:about="125">
     <edm:object rdf:resource="urn2"/>
     <edm:object rdf:resource="urn3"/>
     <...>
  </ore:Aggregation>
  <edm:WebResource rdf:about="urn3">
     <...>
  </edm:WebResource>
  <...>
</rdf:RDF>
So one aggregation refers to one (or more) web resources. I boiled
down
my original query to the following purpose to keep it simple: for
each
aggregation give me the corresponding web resource.
for $agg in db:open($db_name)/rdf:RDF/ore:Aggregation
   return for $urn in $agg/edm:object/@rdf:resource
     return (# db:enforceindex #)
   {db:open($db_name)/rdf:RDF/edm:WebResource[@rdf:about=$urn]}
For both databases the query gives me the required result and the
query
info tells me that the attribute index for $urn is applied in both
cases (this is also the case if I leave out the pragma). However,
oddly
enough, for the "larger" database "def" with a larger attribute
index
it takes roughly 1 second while the "smaller" database "abc" with a
smaller attribute index takes 20 seconds. This is not very long but
the
original query is more complicated and I have bigger databases with
the
same structure where it starts to matter.
The only (and I think important) difference between "abc" and "def"
is
that "abc" contains only 3 web resources that all 5.000
aggregations
refer to. While in "def" each aggregation refers to a particular
web
resource (== 7.000 aggregations and 7.000 web resources).
With index:facets I had a look at the facet values and learned that
there is a "maximum number of distinct values to store per name".
Is
there a difference in performance because of that? Maybe I do not
get
the index structures but it feels strange that it takes longer to
find
the correct attribute in a range of 3 different values than in a
range
of 7.000. Maybe there is also another problem in my query,
databases or
my reasoning that I do not see? Either way, I need help in
understanding this phenomenon :-)
I hope you could follow, please don't hesitate to ask if you need
anything to reproduce this situation (I am using BaseX 9.2.4).
Julia
[1]
https://mailman.uni-konstanz.de/pipermail/basex-talk/2019-July/014511.html

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

Re: [basex-talk] Understanding the path index