Julia -
Wow, thank you so much for your detailed response -- your project sounds really neat!
I'm sorry to say that I forgot to speak to your question:
Hi Bridger,
thanks for your thoughts! It might be a good idea to change the
modeling there. I already have extra databases for other entities
(agents, places, events, ...), so it might be a good idea to separate
the web resources from the rest, too. But I have to test it. I have not
done it yet, because I tried to follow the EDM structure where the web
resource is one of the core classes in the model, so I tried to keep it
together with the aggregation and the acutal object. Thanks for making
me rethink this.
Though I am still curious about why this performance difference occurs.
I think (but am not sure) that the performance hit is due to BaseX having to examine all 5000 webResources sequentially to compare the webResource[@rdf:about] to the $urn value (note: I'm probably saying something egregiously incorrect in that sentence: hopefully Christian, or another knowledgeable user, will be along to help us understand things a bit better).
Does it help at all to wrap the entire query in the pragma?
For your other questions: We are implementing a search portal with
metadata from the performing arts domain and for that we are
aggregating metadata from currently about 20 different data providers
(museums, libraries, archives). The metadata comes to us in different
ways of delivery/harvesting and has very different formats and
standards (none of them is RDF from the beginning). That being said, I
use BaseX to do manipulation like transforming and enrichment on the
metadata.
Outside of BaseX, I am mostly using Python and I ingest the data with
the Python client to BaseX. Like you, I had out of memory issuses in
the beginning. But I started to batch all the records to reasonable
sized packages of about 1000 records (combine files with only one
record per file to bigger files; split files with e.g. 100000 records
to smaller files). With these batches I have no problem to get the data
into BaseX and work with it there and outside of BaseX. Even if the
input is JSON. I also think it is important to use the ADDCACHE option
when ingesting larger files.
I have one database for each data provider (actually my original
question refers to two of those databases [they are not called abc and
def ;-)], though I left out the object part as it is not relevant
here). And several databases for the different entities as mentioned
above. So far, I did not have bigger problems with the namespaces,
maybe because I am not doing anything specific with them and mostly
keep them they way they are.
Best wishes,
Julia
Kind regards,
Bridger
On Fri, 2019-10-25 at 15:44 -0400, Bridger Dyson-Smith wrote:
> Julia, all,
>
> apologies - I hit 'send' a bit too quickly. I wanted to ask how you
> were
> ingesting RDF, what the size of your input looked like, and how you
> were
> dealing with namespaces. I have some RDF/MADS data from the Library
> of
> Congress and I've never been able to ingest it into BaseX, even
> stripping
> namespaces, etc, things seem to run out of memory or stall.
>
> Would you be willing to share some details about your data?
>
> Thanks very much.
> Best,
> Bridger
>
> On Fri, Oct 25, 2019 at 3:41 PM Bridger Dyson-Smith <
> bdysonsmith@gmail.com>
> wrote:
>
> > Hi Julia -
> >
> > Preface: let me be clear when I say that I've wondered about some
> > of this
> > myself, so I don't think I have an answer for you. That being said,
> > I
> > wonder if this is a grouping/data modeling problem: i.e. you have
> > 5,000
> > aggregations refer to 1 of 3 web resources vs ~7,000 aggregations
> > each with
> > their own distinct web resource.
> >
> > If you created 3 databases for "abc" ( hm... "a", "b", and "c"? ☺),
> > one
> > for each web resource (i.e. where there would be only
> > aggregations-to-the-specific-web-resource), would that help with
> > query
> > times at all? It might necessitate a bit of pre-processing in your
> > creation
> > step though.
> >
> > In any event, I hope those random thoughts are helpful in some way.
> > Best,
> > Bridger
> >
> >
> > On Fri, Oct 25, 2019 at 10:24 AM Beck, Julia <
> > J.Beck@ub.uni-frankfurt.de>
> > wrote:
> >
> > > Hi,
> > >
> > > first of all: thank you, the fix for [1] did the trick and in
> > > 9.2.4 the
> > > query is working as expected.
> > > Today, I come back to you with another challenge in performance
> > > which
> > > again seems to have something to do with indexing(?). So here's
> > > the
> > > situation:
> > > I have two databases "abc" and "def". "abc" contains 1 xml doc
> > > with
> > > about 150.000 nodes and "def" contains 1 xml doc with about
> > > 400.000
> > > nodes. Both are similarly strutured and have their up-to-date
> > > text and
> > > attr indexes. The xml docs look both (simplified) like the
> > > following:
> > >
> > > <rdf:RDF>
> > > <ore:Aggregation rdf:about="123">
> > > <edm:object rdf:resource="urn1"/>
> > > <...>
> > > </ore:Aggregation>
> > > <edm:WebResource rdf:about="urn1">
> > > <...>
> > > </edm:WebResource>
> > > <ore:Aggregation rdf:about="124">
> > > <edm:object rdf:resource="urn2"/>
> > > <...>
> > > </ore:Aggregation>
> > > <edm:WebResource rdf:about="urn2">
> > > <...>
> > > </edm:WebResource>
> > > <ore:Aggregation rdf:about="125">
> > > <edm:object rdf:resource="urn2"/>
> > > <edm:object rdf:resource="urn3"/>
> > > <...>
> > > </ore:Aggregation>
> > > <edm:WebResource rdf:about="urn3">
> > > <...>
> > > </edm:WebResource>
> > > <...>
> > > </rdf:RDF>
> > >
> > > So one aggregation refers to one (or more) web resources. I
> > > boiled down
> > > my original query to the following purpose to keep it simple: for
> > > each
> > > aggregation give me the corresponding web resource.
> > >
> > > for $agg in db:open($db_name)/rdf:RDF/ore:Aggregation
> > > return for $urn in $agg/edm:object/@rdf:resource
> > > return (# db:enforceindex #)
> > > {db:open($db_name)/rdf:RDF/edm:WebResource[@rdf:about=$urn]}
> > >
> > > For both databases the query gives me the required result and the
> > > query
> > > info tells me that the attribute index for $urn is applied in
> > > both
> > > cases (this is also the case if I leave out the pragma). However,
> > > oddly
> > > enough, for the "larger" database "def" with a larger attribute
> > > index
> > > it takes roughly 1 second while the "smaller" database "abc" with
> > > a
> > > smaller attribute index takes 20 seconds. This is not very long
> > > but the
> > > original query is more complicated and I have bigger databases
> > > with the
> > > same structure where it starts to matter.
> > >
> > > The only (and I think important) difference between "abc" and
> > > "def" is
> > > that "abc" contains only 3 web resources that all 5.000
> > > aggregations
> > > refer to. While in "def" each aggregation refers to a particular
> > > web
> > > resource (== 7.000 aggregations and 7.000 web resources).
> > >
> > > With index:facets I had a look at the facet values and learned
> > > that
> > > there is a "maximum number of distinct values to store per name".
> > > Is
> > > there a difference in performance because of that? Maybe I do not
> > > get
> > > the index structures but it feels strange that it takes longer to
> > > find
> > > the correct attribute in a range of 3 different values than in a
> > > range
> > > of 7.000. Maybe there is also another problem in my query,
> > > databases or
> > > my reasoning that I do not see? Either way, I need help in
> > > understanding this phenomenon :-)
> > >
> > > I hope you could follow, please don't hesitate to ask if you need
> > > anything to reproduce this situation (I am using BaseX 9.2.4).
> > >
> > > Julia
> > >
> > > [1]
> > > https://mailman.uni-konstanz.de/pipermail/basex-talk/2019-July/014511.html
> > >