Hi,
first of all: thank you, the fix for [1] did the trick and in 9.2.4 the query is working as expected. Today, I come back to you with another challenge in performance which again seems to have something to do with indexing(?). So here's the situation: I have two databases "abc" and "def". "abc" contains 1 xml doc with about 150.000 nodes and "def" contains 1 xml doc with about 400.000 nodes. Both are similarly strutured and have their up-to-date text and attr indexes. The xml docs look both (simplified) like the following:
rdf:RDF <ore:Aggregation rdf:about="123"> <edm:object rdf:resource="urn1"/> <...> </ore:Aggregation> <edm:WebResource rdf:about="urn1"> <...> </edm:WebResource> <ore:Aggregation rdf:about="124"> <edm:object rdf:resource="urn2"/> <...> </ore:Aggregation> <edm:WebResource rdf:about="urn2"> <...> </edm:WebResource> <ore:Aggregation rdf:about="125"> <edm:object rdf:resource="urn2"/> <edm:object rdf:resource="urn3"/> <...> </ore:Aggregation> <edm:WebResource rdf:about="urn3"> <...> </edm:WebResource> <...> </rdf:RDF>
So one aggregation refers to one (or more) web resources. I boiled down my original query to the following purpose to keep it simple: for each aggregation give me the corresponding web resource.
for $agg in db:open($db_name)/rdf:RDF/ore:Aggregation return for $urn in $agg/edm:object/@rdf:resource return (# db:enforceindex #) {db:open($db_name)/rdf:RDF/edm:WebResource[@rdf:about=$urn]}
For both databases the query gives me the required result and the query info tells me that the attribute index for $urn is applied in both cases (this is also the case if I leave out the pragma). However, oddly enough, for the "larger" database "def" with a larger attribute index it takes roughly 1 second while the "smaller" database "abc" with a smaller attribute index takes 20 seconds. This is not very long but the original query is more complicated and I have bigger databases with the same structure where it starts to matter.
The only (and I think important) difference between "abc" and "def" is that "abc" contains only 3 web resources that all 5.000 aggregations refer to. While in "def" each aggregation refers to a particular web resource (== 7.000 aggregations and 7.000 web resources).
With index:facets I had a look at the facet values and learned that there is a "maximum number of distinct values to store per name". Is there a difference in performance because of that? Maybe I do not get the index structures but it feels strange that it takes longer to find the correct attribute in a range of 3 different values than in a range of 7.000. Maybe there is also another problem in my query, databases or my reasoning that I do not see? Either way, I need help in understanding this phenomenon :-)
I hope you could follow, please don't hesitate to ask if you need anything to reproduce this situation (I am using BaseX 9.2.4).
Julia
[1] https://mailman.uni-konstanz.de/pipermail/basex-talk/2019-July/014511.html
Hi Julia -
Preface: let me be clear when I say that I've wondered about some of this myself, so I don't think I have an answer for you. That being said, I wonder if this is a grouping/data modeling problem: i.e. you have 5,000 aggregations refer to 1 of 3 web resources vs ~7,000 aggregations each with their own distinct web resource.
If you created 3 databases for "abc" ( hm... "a", "b", and "c"? ☺), one for each web resource (i.e. where there would be only aggregations-to-the-specific-web-resource), would that help with query times at all? It might necessitate a bit of pre-processing in your creation step though.
In any event, I hope those random thoughts are helpful in some way. Best, Bridger
On Fri, Oct 25, 2019 at 10:24 AM Beck, Julia J.Beck@ub.uni-frankfurt.de wrote:
Hi,
first of all: thank you, the fix for [1] did the trick and in 9.2.4 the query is working as expected. Today, I come back to you with another challenge in performance which again seems to have something to do with indexing(?). So here's the situation: I have two databases "abc" and "def". "abc" contains 1 xml doc with about 150.000 nodes and "def" contains 1 xml doc with about 400.000 nodes. Both are similarly strutured and have their up-to-date text and attr indexes. The xml docs look both (simplified) like the following:
rdf:RDF <ore:Aggregation rdf:about="123"> <edm:object rdf:resource="urn1"/> <...> </ore:Aggregation> <edm:WebResource rdf:about="urn1"> <...> </edm:WebResource> <ore:Aggregation rdf:about="124"> <edm:object rdf:resource="urn2"/> <...> </ore:Aggregation> <edm:WebResource rdf:about="urn2"> <...> </edm:WebResource> <ore:Aggregation rdf:about="125"> <edm:object rdf:resource="urn2"/> <edm:object rdf:resource="urn3"/> <...> </ore:Aggregation> <edm:WebResource rdf:about="urn3"> <...> </edm:WebResource> <...> </rdf:RDF>
So one aggregation refers to one (or more) web resources. I boiled down my original query to the following purpose to keep it simple: for each aggregation give me the corresponding web resource.
for $agg in db:open($db_name)/rdf:RDF/ore:Aggregation return for $urn in $agg/edm:object/@rdf:resource return (# db:enforceindex #) {db:open($db_name)/rdf:RDF/edm:WebResource[@rdf:about=$urn]}
For both databases the query gives me the required result and the query info tells me that the attribute index for $urn is applied in both cases (this is also the case if I leave out the pragma). However, oddly enough, for the "larger" database "def" with a larger attribute index it takes roughly 1 second while the "smaller" database "abc" with a smaller attribute index takes 20 seconds. This is not very long but the original query is more complicated and I have bigger databases with the same structure where it starts to matter.
The only (and I think important) difference between "abc" and "def" is that "abc" contains only 3 web resources that all 5.000 aggregations refer to. While in "def" each aggregation refers to a particular web resource (== 7.000 aggregations and 7.000 web resources).
With index:facets I had a look at the facet values and learned that there is a "maximum number of distinct values to store per name". Is there a difference in performance because of that? Maybe I do not get the index structures but it feels strange that it takes longer to find the correct attribute in a range of 3 different values than in a range of 7.000. Maybe there is also another problem in my query, databases or my reasoning that I do not see? Either way, I need help in understanding this phenomenon :-)
I hope you could follow, please don't hesitate to ask if you need anything to reproduce this situation (I am using BaseX 9.2.4).
Julia
[1] https://mailman.uni-konstanz.de/pipermail/basex-talk/2019-July/014511.html
Julia, all,
apologies - I hit 'send' a bit too quickly. I wanted to ask how you were ingesting RDF, what the size of your input looked like, and how you were dealing with namespaces. I have some RDF/MADS data from the Library of Congress and I've never been able to ingest it into BaseX, even stripping namespaces, etc, things seem to run out of memory or stall.
Would you be willing to share some details about your data?
Thanks very much. Best, Bridger
On Fri, Oct 25, 2019 at 3:41 PM Bridger Dyson-Smith bdysonsmith@gmail.com wrote:
Hi Julia -
Preface: let me be clear when I say that I've wondered about some of this myself, so I don't think I have an answer for you. That being said, I wonder if this is a grouping/data modeling problem: i.e. you have 5,000 aggregations refer to 1 of 3 web resources vs ~7,000 aggregations each with their own distinct web resource.
If you created 3 databases for "abc" ( hm... "a", "b", and "c"? ☺), one for each web resource (i.e. where there would be only aggregations-to-the-specific-web-resource), would that help with query times at all? It might necessitate a bit of pre-processing in your creation step though.
In any event, I hope those random thoughts are helpful in some way. Best, Bridger
On Fri, Oct 25, 2019 at 10:24 AM Beck, Julia J.Beck@ub.uni-frankfurt.de wrote:
Hi,
first of all: thank you, the fix for [1] did the trick and in 9.2.4 the query is working as expected. Today, I come back to you with another challenge in performance which again seems to have something to do with indexing(?). So here's the situation: I have two databases "abc" and "def". "abc" contains 1 xml doc with about 150.000 nodes and "def" contains 1 xml doc with about 400.000 nodes. Both are similarly strutured and have their up-to-date text and attr indexes. The xml docs look both (simplified) like the following:
rdf:RDF <ore:Aggregation rdf:about="123"> <edm:object rdf:resource="urn1"/> <...> </ore:Aggregation> <edm:WebResource rdf:about="urn1"> <...> </edm:WebResource> <ore:Aggregation rdf:about="124"> <edm:object rdf:resource="urn2"/> <...> </ore:Aggregation> <edm:WebResource rdf:about="urn2"> <...> </edm:WebResource> <ore:Aggregation rdf:about="125"> <edm:object rdf:resource="urn2"/> <edm:object rdf:resource="urn3"/> <...> </ore:Aggregation> <edm:WebResource rdf:about="urn3"> <...> </edm:WebResource> <...> </rdf:RDF>
So one aggregation refers to one (or more) web resources. I boiled down my original query to the following purpose to keep it simple: for each aggregation give me the corresponding web resource.
for $agg in db:open($db_name)/rdf:RDF/ore:Aggregation return for $urn in $agg/edm:object/@rdf:resource return (# db:enforceindex #) {db:open($db_name)/rdf:RDF/edm:WebResource[@rdf:about=$urn]}
For both databases the query gives me the required result and the query info tells me that the attribute index for $urn is applied in both cases (this is also the case if I leave out the pragma). However, oddly enough, for the "larger" database "def" with a larger attribute index it takes roughly 1 second while the "smaller" database "abc" with a smaller attribute index takes 20 seconds. This is not very long but the original query is more complicated and I have bigger databases with the same structure where it starts to matter.
The only (and I think important) difference between "abc" and "def" is that "abc" contains only 3 web resources that all 5.000 aggregations refer to. While in "def" each aggregation refers to a particular web resource (== 7.000 aggregations and 7.000 web resources).
With index:facets I had a look at the facet values and learned that there is a "maximum number of distinct values to store per name". Is there a difference in performance because of that? Maybe I do not get the index structures but it feels strange that it takes longer to find the correct attribute in a range of 3 different values than in a range of 7.000. Maybe there is also another problem in my query, databases or my reasoning that I do not see? Either way, I need help in understanding this phenomenon :-)
I hope you could follow, please don't hesitate to ask if you need anything to reproduce this situation (I am using BaseX 9.2.4).
Julia
[1] https://mailman.uni-konstanz.de/pipermail/basex-talk/2019-July/014511.html
Hi Bridger,
thanks for your thoughts! It might be a good idea to change the modeling there. I already have extra databases for other entities (agents, places, events, ...), so it might be a good idea to separate the web resources from the rest, too. But I have to test it. I have not done it yet, because I tried to follow the EDM structure where the web resource is one of the core classes in the model, so I tried to keep it together with the aggregation and the acutal object. Thanks for making me rethink this. Though I am still curious about why this performance difference occurs.
For your other questions: We are implementing a search portal with metadata from the performing arts domain and for that we are aggregating metadata from currently about 20 different data providers (museums, libraries, archives). The metadata comes to us in different ways of delivery/harvesting and has very different formats and standards (none of them is RDF from the beginning). That being said, I use BaseX to do manipulation like transforming and enrichment on the metadata. Outside of BaseX, I am mostly using Python and I ingest the data with the Python client to BaseX. Like you, I had out of memory issuses in the beginning. But I started to batch all the records to reasonable sized packages of about 1000 records (combine files with only one record per file to bigger files; split files with e.g. 100000 records to smaller files). With these batches I have no problem to get the data into BaseX and work with it there and outside of BaseX. Even if the input is JSON. I also think it is important to use the ADDCACHE option when ingesting larger files. I have one database for each data provider (actually my original question refers to two of those databases [they are not called abc and def ;-)], though I left out the object part as it is not relevant here). And several databases for the different entities as mentioned above. So far, I did not have bigger problems with the namespaces, maybe because I am not doing anything specific with them and mostly keep them they way they are.
Best wishes, Julia
On Fri, 2019-10-25 at 15:44 -0400, Bridger Dyson-Smith wrote:
Julia, all,
apologies - I hit 'send' a bit too quickly. I wanted to ask how you were ingesting RDF, what the size of your input looked like, and how you were dealing with namespaces. I have some RDF/MADS data from the Library of Congress and I've never been able to ingest it into BaseX, even stripping namespaces, etc, things seem to run out of memory or stall.
Would you be willing to share some details about your data?
Thanks very much. Best, Bridger
On Fri, Oct 25, 2019 at 3:41 PM Bridger Dyson-Smith < bdysonsmith@gmail.com> wrote:
Hi Julia -
Preface: let me be clear when I say that I've wondered about some of this myself, so I don't think I have an answer for you. That being said, I wonder if this is a grouping/data modeling problem: i.e. you have 5,000 aggregations refer to 1 of 3 web resources vs ~7,000 aggregations each with their own distinct web resource.
If you created 3 databases for "abc" ( hm... "a", "b", and "c"? ☺), one for each web resource (i.e. where there would be only aggregations-to-the-specific-web-resource), would that help with query times at all? It might necessitate a bit of pre-processing in your creation step though.
In any event, I hope those random thoughts are helpful in some way. Best, Bridger
On Fri, Oct 25, 2019 at 10:24 AM Beck, Julia < J.Beck@ub.uni-frankfurt.de> wrote:
Hi,
first of all: thank you, the fix for [1] did the trick and in 9.2.4 the query is working as expected. Today, I come back to you with another challenge in performance which again seems to have something to do with indexing(?). So here's the situation: I have two databases "abc" and "def". "abc" contains 1 xml doc with about 150.000 nodes and "def" contains 1 xml doc with about 400.000 nodes. Both are similarly strutured and have their up-to-date text and attr indexes. The xml docs look both (simplified) like the following:
rdf:RDF <ore:Aggregation rdf:about="123"> <edm:object rdf:resource="urn1"/> <...> </ore:Aggregation> <edm:WebResource rdf:about="urn1"> <...> </edm:WebResource> <ore:Aggregation rdf:about="124"> <edm:object rdf:resource="urn2"/> <...> </ore:Aggregation> <edm:WebResource rdf:about="urn2"> <...> </edm:WebResource> <ore:Aggregation rdf:about="125"> <edm:object rdf:resource="urn2"/> <edm:object rdf:resource="urn3"/> <...> </ore:Aggregation> <edm:WebResource rdf:about="urn3"> <...> </edm:WebResource> <...> </rdf:RDF>
So one aggregation refers to one (or more) web resources. I boiled down my original query to the following purpose to keep it simple: for each aggregation give me the corresponding web resource.
for $agg in db:open($db_name)/rdf:RDF/ore:Aggregation return for $urn in $agg/edm:object/@rdf:resource return (# db:enforceindex #) {db:open($db_name)/rdf:RDF/edm:WebResource[@rdf:about=$urn]}
For both databases the query gives me the required result and the query info tells me that the attribute index for $urn is applied in both cases (this is also the case if I leave out the pragma). However, oddly enough, for the "larger" database "def" with a larger attribute index it takes roughly 1 second while the "smaller" database "abc" with a smaller attribute index takes 20 seconds. This is not very long but the original query is more complicated and I have bigger databases with the same structure where it starts to matter.
The only (and I think important) difference between "abc" and "def" is that "abc" contains only 3 web resources that all 5.000 aggregations refer to. While in "def" each aggregation refers to a particular web resource (== 7.000 aggregations and 7.000 web resources).
With index:facets I had a look at the facet values and learned that there is a "maximum number of distinct values to store per name". Is there a difference in performance because of that? Maybe I do not get the index structures but it feels strange that it takes longer to find the correct attribute in a range of 3 different values than in a range of 7.000. Maybe there is also another problem in my query, databases or my reasoning that I do not see? Either way, I need help in understanding this phenomenon :-)
I hope you could follow, please don't hesitate to ask if you need anything to reproduce this situation (I am using BaseX 9.2.4).
Julia
[1] https://mailman.uni-konstanz.de/pipermail/basex-talk/2019-July/014511.html
Julia - Wow, thank you so much for your detailed response -- your project sounds really neat!
I'm sorry to say that I forgot to speak to your question:
On Mon, Oct 28, 2019 at 5:51 AM Beck, Julia J.Beck@ub.uni-frankfurt.de wrote:
Hi Bridger,
thanks for your thoughts! It might be a good idea to change the modeling there. I already have extra databases for other entities (agents, places, events, ...), so it might be a good idea to separate the web resources from the rest, too. But I have to test it. I have not done it yet, because I tried to follow the EDM structure where the web resource is one of the core classes in the model, so I tried to keep it together with the aggregation and the acutal object. Thanks for making me rethink this.
Though I am still curious about why this performance difference occurs.
I think (but am not sure) that the performance hit is due to BaseX having
to examine all 5000 webResources sequentially to compare the webResource[@rdf:about] to the $urn value (note: I'm probably saying something egregiously incorrect in that sentence: hopefully Christian, or another knowledgeable user, will be along to help us understand things a bit better).
Does it help at all to wrap the entire query in the pragma?
For your other questions: We are implementing a search portal with metadata from the performing arts domain and for that we are aggregating metadata from currently about 20 different data providers (museums, libraries, archives). The metadata comes to us in different ways of delivery/harvesting and has very different formats and standards (none of them is RDF from the beginning). That being said, I use BaseX to do manipulation like transforming and enrichment on the metadata. Outside of BaseX, I am mostly using Python and I ingest the data with the Python client to BaseX. Like you, I had out of memory issuses in the beginning. But I started to batch all the records to reasonable sized packages of about 1000 records (combine files with only one record per file to bigger files; split files with e.g. 100000 records to smaller files). With these batches I have no problem to get the data into BaseX and work with it there and outside of BaseX. Even if the input is JSON. I also think it is important to use the ADDCACHE option when ingesting larger files. I have one database for each data provider (actually my original question refers to two of those databases [they are not called abc and def ;-)], though I left out the object part as it is not relevant here). And several databases for the different entities as mentioned above. So far, I did not have bigger problems with the namespaces, maybe because I am not doing anything specific with them and mostly keep them they way they are.
Best wishes, Julia
Kind regards, Bridger
On Fri, 2019-10-25 at 15:44 -0400, Bridger Dyson-Smith wrote:
Julia, all,
apologies - I hit 'send' a bit too quickly. I wanted to ask how you were ingesting RDF, what the size of your input looked like, and how you were dealing with namespaces. I have some RDF/MADS data from the Library of Congress and I've never been able to ingest it into BaseX, even stripping namespaces, etc, things seem to run out of memory or stall.
Would you be willing to share some details about your data?
Thanks very much. Best, Bridger
On Fri, Oct 25, 2019 at 3:41 PM Bridger Dyson-Smith < bdysonsmith@gmail.com> wrote:
Hi Julia -
Preface: let me be clear when I say that I've wondered about some of this myself, so I don't think I have an answer for you. That being said, I wonder if this is a grouping/data modeling problem: i.e. you have 5,000 aggregations refer to 1 of 3 web resources vs ~7,000 aggregations each with their own distinct web resource.
If you created 3 databases for "abc" ( hm... "a", "b", and "c"? ☺), one for each web resource (i.e. where there would be only aggregations-to-the-specific-web-resource), would that help with query times at all? It might necessitate a bit of pre-processing in your creation step though.
In any event, I hope those random thoughts are helpful in some way. Best, Bridger
On Fri, Oct 25, 2019 at 10:24 AM Beck, Julia < J.Beck@ub.uni-frankfurt.de> wrote:
Hi,
first of all: thank you, the fix for [1] did the trick and in 9.2.4 the query is working as expected. Today, I come back to you with another challenge in performance which again seems to have something to do with indexing(?). So here's the situation: I have two databases "abc" and "def". "abc" contains 1 xml doc with about 150.000 nodes and "def" contains 1 xml doc with about 400.000 nodes. Both are similarly strutured and have their up-to-date text and attr indexes. The xml docs look both (simplified) like the following:
rdf:RDF <ore:Aggregation rdf:about="123"> <edm:object rdf:resource="urn1"/> <...> </ore:Aggregation> <edm:WebResource rdf:about="urn1"> <...> </edm:WebResource> <ore:Aggregation rdf:about="124"> <edm:object rdf:resource="urn2"/> <...> </ore:Aggregation> <edm:WebResource rdf:about="urn2"> <...> </edm:WebResource> <ore:Aggregation rdf:about="125"> <edm:object rdf:resource="urn2"/> <edm:object rdf:resource="urn3"/> <...> </ore:Aggregation> <edm:WebResource rdf:about="urn3"> <...> </edm:WebResource> <...> </rdf:RDF>
So one aggregation refers to one (or more) web resources. I boiled down my original query to the following purpose to keep it simple: for each aggregation give me the corresponding web resource.
for $agg in db:open($db_name)/rdf:RDF/ore:Aggregation return for $urn in $agg/edm:object/@rdf:resource return (# db:enforceindex #) {db:open($db_name)/rdf:RDF/edm:WebResource[@rdf:about=$urn]}
For both databases the query gives me the required result and the query info tells me that the attribute index for $urn is applied in both cases (this is also the case if I leave out the pragma). However, oddly enough, for the "larger" database "def" with a larger attribute index it takes roughly 1 second while the "smaller" database "abc" with a smaller attribute index takes 20 seconds. This is not very long but the original query is more complicated and I have bigger databases with the same structure where it starts to matter.
The only (and I think important) difference between "abc" and "def" is that "abc" contains only 3 web resources that all 5.000 aggregations refer to. While in "def" each aggregation refers to a particular web resource (== 7.000 aggregations and 7.000 web resources).
With index:facets I had a look at the facet values and learned that there is a "maximum number of distinct values to store per name". Is there a difference in performance because of that? Maybe I do not get the index structures but it feels strange that it takes longer to find the correct attribute in a range of 3 different values than in a range of 7.000. Maybe there is also another problem in my query, databases or my reasoning that I do not see? Either way, I need help in understanding this phenomenon :-)
I hope you could follow, please don't hesitate to ask if you need anything to reproduce this situation (I am using BaseX 9.2.4).
Julia
[1]
https://mailman.uni-konstanz.de/pipermail/basex-talk/2019-July/014511.html
Hi Julia,
I’m glad to hear the enforce bug fix is working as expected.
I read through your use case, and I believe your observation with regards to the differences of 'abc' and 'def' was a good hint:
The db:attribute function yields all attribute nodes of a database that match the specified string in a first step, and reduces them to the attributes with the corresponding name in a second step. In the following (worst) case example…
<doc id='1'> <ref idref='1'/> .... 998 more times <ref idref='1'/> </doc>
…the db:attribute('db', '1', 'id') function will receive 1000 attribute nodes from the index (whis very fast). In the subsequent name tests, the targeted attributes will be requested from the database, and only one attribute will be accepted and returned as result.
In your 'abc' document, most resource ids seem to be identical, so my guess is that your index lookup leads to a large number of hits that will be checked with each db:attribute call.
Maybe your query can be sped up with a map:
let $resources := map:merge( for $r in db:open($db_name)/rdf:RDF/edm:WebResource return map:entry($r/@rdf:about, $r) ) for $agg in db:open($db_name)/rdf:RDF/ore:Aggregation for $urn in $agg/edm:object/@rdf:resource return $resources($urn)
One more solution could be to restrict the attribute index to specific attributes (e.g., rdf:about) via the ATTRINCLUDE option.
Cheers, Christian
PS: If my explanations were too cryptical, feel free to ask for enlightenment.
On Fri, Oct 25, 2019 at 4:24 PM Beck, Julia J.Beck@ub.uni-frankfurt.de wrote:
Hi,
first of all: thank you, the fix for [1] did the trick and in 9.2.4 the query is working as expected. Today, I come back to you with another challenge in performance which again seems to have something to do with indexing(?). So here's the situation: I have two databases "abc" and "def". "abc" contains 1 xml doc with about 150.000 nodes and "def" contains 1 xml doc with about 400.000 nodes. Both are similarly strutured and have their up-to-date text and attr indexes. The xml docs look both (simplified) like the following:
rdf:RDF <ore:Aggregation rdf:about="123"> <edm:object rdf:resource="urn1"/> <...> </ore:Aggregation> <edm:WebResource rdf:about="urn1"> <...> </edm:WebResource> <ore:Aggregation rdf:about="124"> <edm:object rdf:resource="urn2"/> <...> </ore:Aggregation> <edm:WebResource rdf:about="urn2"> <...> </edm:WebResource> <ore:Aggregation rdf:about="125"> <edm:object rdf:resource="urn2"/> <edm:object rdf:resource="urn3"/> <...> </ore:Aggregation> <edm:WebResource rdf:about="urn3"> <...> </edm:WebResource> <...> </rdf:RDF>
So one aggregation refers to one (or more) web resources. I boiled down my original query to the following purpose to keep it simple: for each aggregation give me the corresponding web resource.
for $agg in db:open($db_name)/rdf:RDF/ore:Aggregation return for $urn in $agg/edm:object/@rdf:resource return (# db:enforceindex #) {db:open($db_name)/rdf:RDF/edm:WebResource[@rdf:about=$urn]}
For both databases the query gives me the required result and the query info tells me that the attribute index for $urn is applied in both cases (this is also the case if I leave out the pragma). However, oddly enough, for the "larger" database "def" with a larger attribute index it takes roughly 1 second while the "smaller" database "abc" with a smaller attribute index takes 20 seconds. This is not very long but the original query is more complicated and I have bigger databases with the same structure where it starts to matter.
The only (and I think important) difference between "abc" and "def" is that "abc" contains only 3 web resources that all 5.000 aggregations refer to. While in "def" each aggregation refers to a particular web resource (== 7.000 aggregations and 7.000 web resources).
With index:facets I had a look at the facet values and learned that there is a "maximum number of distinct values to store per name". Is there a difference in performance because of that? Maybe I do not get the index structures but it feels strange that it takes longer to find the correct attribute in a range of 3 different values than in a range of 7.000. Maybe there is also another problem in my query, databases or my reasoning that I do not see? Either way, I need help in understanding this phenomenon :-)
I hope you could follow, please don't hesitate to ask if you need anything to reproduce this situation (I am using BaseX 9.2.4).
Julia
[1] https://mailman.uni-konstanz.de/pipermail/basex-talk/2019-July/014511.html
Hi Christian,
I see, that makes sense. Thank you for your explanation, I think I understand better now how it works "in the background" and why it was slow before. The map is a great solution for my original query which is super fast now! Thank you for your help,
Julia
On Tue, 2019-10-29 at 03:56 +0100, Christian Grün wrote:
Hi Julia,
I’m glad to hear the enforce bug fix is working as expected.
I read through your use case, and I believe your observation with regards to the differences of 'abc' and 'def' was a good hint:
The db:attribute function yields all attribute nodes of a database that match the specified string in a first step, and reduces them to the attributes with the corresponding name in a second step. In the following (worst) case example…
<doc id='1'> <ref idref='1'/> .... 998 more times <ref idref='1'/> </doc>
…the db:attribute('db', '1', 'id') function will receive 1000 attribute nodes from the index (whis very fast). In the subsequent name tests, the targeted attributes will be requested from the database, and only one attribute will be accepted and returned as result.
In your 'abc' document, most resource ids seem to be identical, so my guess is that your index lookup leads to a large number of hits that will be checked with each db:attribute call.
Maybe your query can be sped up with a map:
let $resources := map:merge( for $r in db:open($db_name)/rdf:RDF/edm:WebResource return map:entry($r/@rdf:about, $r) ) for $agg in db:open($db_name)/rdf:RDF/ore:Aggregation for $urn in $agg/edm:object/@rdf:resource return $resources($urn)
One more solution could be to restrict the attribute index to specific attributes (e.g., rdf:about) via the ATTRINCLUDE option.
Cheers, Christian
PS: If my explanations were too cryptical, feel free to ask for enlightenment.
On Fri, Oct 25, 2019 at 4:24 PM Beck, Julia < J.Beck@ub.uni-frankfurt.de> wrote:
Hi,
first of all: thank you, the fix for [1] did the trick and in 9.2.4 the query is working as expected. Today, I come back to you with another challenge in performance which again seems to have something to do with indexing(?). So here's the situation: I have two databases "abc" and "def". "abc" contains 1 xml doc with about 150.000 nodes and "def" contains 1 xml doc with about 400.000 nodes. Both are similarly strutured and have their up-to-date text and attr indexes. The xml docs look both (simplified) like the following:
rdf:RDF <ore:Aggregation rdf:about="123"> <edm:object rdf:resource="urn1"/> <...> </ore:Aggregation> <edm:WebResource rdf:about="urn1"> <...> </edm:WebResource> <ore:Aggregation rdf:about="124"> <edm:object rdf:resource="urn2"/> <...> </ore:Aggregation> <edm:WebResource rdf:about="urn2"> <...> </edm:WebResource> <ore:Aggregation rdf:about="125"> <edm:object rdf:resource="urn2"/> <edm:object rdf:resource="urn3"/> <...> </ore:Aggregation> <edm:WebResource rdf:about="urn3"> <...> </edm:WebResource> <...> </rdf:RDF>
So one aggregation refers to one (or more) web resources. I boiled down my original query to the following purpose to keep it simple: for each aggregation give me the corresponding web resource.
for $agg in db:open($db_name)/rdf:RDF/ore:Aggregation return for $urn in $agg/edm:object/@rdf:resource return (# db:enforceindex #) {db:open($db_name)/rdf:RDF/edm:WebResource[@rdf:about=$urn]}
For both databases the query gives me the required result and the query info tells me that the attribute index for $urn is applied in both cases (this is also the case if I leave out the pragma). However, oddly enough, for the "larger" database "def" with a larger attribute index it takes roughly 1 second while the "smaller" database "abc" with a smaller attribute index takes 20 seconds. This is not very long but the original query is more complicated and I have bigger databases with the same structure where it starts to matter.
The only (and I think important) difference between "abc" and "def" is that "abc" contains only 3 web resources that all 5.000 aggregations refer to. While in "def" each aggregation refers to a particular web resource (== 7.000 aggregations and 7.000 web resources).
With index:facets I had a look at the facet values and learned that there is a "maximum number of distinct values to store per name". Is there a difference in performance because of that? Maybe I do not get the index structures but it feels strange that it takes longer to find the correct attribute in a range of 3 different values than in a range of 7.000. Maybe there is also another problem in my query, databases or my reasoning that I do not see? Either way, I need help in understanding this phenomenon :-)
I hope you could follow, please don't hesitate to ask if you need anything to reproduce this situation (I am using BaseX 9.2.4).
Julia
[1] https://mailman.uni-konstanz.de/pipermail/basex-talk/2019-July/014511.html
basex-talk@mailman.uni-konstanz.de