Hi all,
I have two databases in BaseX, source_db and target_db, and would like to merge them by matching on the id attribute of each element and upserting the element with a `replace` or an `insert` depending on whether the element was found in the `target_db`. `source_db` has about 100,000 elements, and `target_db` has about 1,000,000 elements. The databases look like this:
<!-- source_db contents --> <root> <element id="1"> <element id="2"> </root> <!-- target_db contents --> <root> <element id="1"> </root>
And my query to merge the two looks like this:
for $e in (db:open("source_db")/root/element) return ( if (exists(db:open("target_db")/root/element[@id = data($e/@id)])) then replace node db:open("target_db")/root/element[@id = data($e/@id)] with $e else insert node $e into db:open("target_db")/root )
When running the query, however, I keep getting memory constraint errors. Using a POST request to BaseX's REST interface I get "Out of Main Memory" and using the BaseX java client ( https://github.com/BaseXdb/basex/blob/master/basex-examples/src/main/java/or...) I get "java.io.IOException: GC overhead limit exceeded".
Ideally I would like to just process one element from source_db at a time to avoid memory issues, but it seems like my query isn't doing this. I've tried using the `db:copynode false` pragma but it did not make a difference.
Is there any way to accomplish this? Thanks in advance!
Matt
Hi Matt,
If there are only a few database entries that differ, you could compare the database entries before replacing them:
for $se in db:open("source_db")/root/element[position() = $range] let $te := db:open("target_db")/root/element[@id = $se/@id] return if (empty($te)) then ( insert node $se into db:open("target_db")/root ) else if(not(deep-equal($se, $te))) then ( replace node $te with $se ) else ( )
If that doesn’t help, and if increasing memory is no option, you could call your scripts multiple times and pass on the first entry and the number of entries as external variables:
declare variable $FIRST external; declare variable $COUNT external;
let $range := $FIRST to $FIRST + $COUNT - 1 for $se in db:open("source_db")/root/element[position() = $range] let $te := db:open("target_db")/root/element[@id = $se/@id] return if (empty($te)) then ( insert node $se into db:open("target_db")/root ) else if(not(deep-equal($se, $te))) then ( replace node $te with $se ) else ( )
The “pending update list” is the reason why update operations are not immediately performed [1]. This concept ensures that single update operations cannot cause conflicts that would require an eventual rollback. The obvious drawback is that this leads to a larger memory consumption.
Hope this helps, Christian
[1] https://docs.basex.org/wiki/XQuery_Update#Pending_Update_List
On Mon, Jul 26, 2021 at 7:08 PM Matthew Dziuban mrdziuban@gmail.com wrote:
Hi all,
I have two databases in BaseX, source_db and target_db, and would like to merge them by matching on the id attribute of each element and upserting the element with a `replace` or an `insert` depending on whether the element was found in the `target_db`. `source_db` has about 100,000 elements, and `target_db` has about 1,000,000 elements. The databases look like this:
<!-- source_db contents -->
<root> <element id="1"> <element id="2"> </root> <!-- target_db contents --> <root> <element id="1"> </root>
And my query to merge the two looks like this:
for $e in (db:open("source_db")/root/element) return ( if (exists(db:open("target_db")/root/element[@id = data($e/@id)])) then replace node db:open("target_db")/root/element[@id = data($e/@id)] with $e else insert node $e into db:open("target_db")/root )
When running the query, however, I keep getting memory constraint errors. Using a POST request to BaseX's REST interface I get "Out of Main Memory" and using the BaseX java client (https://github.com/BaseXdb/basex/blob/master/basex-examples/src/main/java/or...) I get "java.io.IOException: GC overhead limit exceeded".
Ideally I would like to just process one element from source_db at a time to avoid memory issues, but it seems like my query isn't doing this. I've tried using the `db:copynode false` pragma but it did not make a difference.
Is there any way to accomplish this? Thanks in advance!
Matt
Thank you Christian and Graydon! I've got a solution working by calling the script multiple times to do only a subset of the updates each time. On a different (though related) note, do I understand correctly that I should run a "db:optimize($db)" after each time I perform a significant amount of updates?
Matt
On Mon, Jul 26, 2021 at 1:32 PM Graydon graydonish@gmail.com wrote:
On Mon, Jul 26, 2021 at 01:08:19PM -0400, Matthew Dziuban scripsit:
Is there any way to accomplish this? Thanks in advance!
It might be easier to generate the merged content, create a db from that, then replace the old target DB with that whole.
-- Graydon
Thank you Christian and Graydon! I've got a solution working by calling the script multiple times to do only a subset of the updates each time. On a different (though related) note, do I understand correctly that I should run a "db:optimize($db)" after each time I perform a significant amount of updates?
It’s definitely advisable if you perform queries that take advantage of the BaseX index structures – which is already the case for root/element[@id = $se/@id]. If the UPDINDEX option is enabled, the index structures will always be kept up-to-date (but again, a database optimization might be recommendable to minimize the index structures).
Thanks again, Christian. Regardless of whether I have the UPDINDEX and AUTOOPTIMIZE options enabled, I'm seeing that my first set of updates runs pretty quickly (in about 90 seconds) but any subsequent set of updates hangs indefinitely -- I let it run for over 2 hours and it never completed. Do you have any idea what could be going on?
On Mon, Jul 26, 2021 at 9:41 PM Christian Grün christian.gruen@gmail.com wrote:
Thank you Christian and Graydon! I've got a solution working by calling
the script multiple times to do only a subset of the updates each time. On a different (though related) note, do I understand correctly that I should run a "db:optimize($db)" after each time I perform a significant amount of updates?
It’s definitely advisable if you perform queries that take advantage of the BaseX index structures – which is already the case for root/element[@id = $se/@id]. If the UPDINDEX option is enabled, the index structures will always be kept up-to-date (but again, a database optimization might be recommendable to minimize the index structures).
On Tue, Jul 27, 2021 at 04:43:16PM -0400, Matthew Dziuban scripsit:
Thanks again, Christian. Regardless of whether I have the UPDINDEX and AUTOOPTIMIZE options enabled, I'm seeing that my first set of updates runs pretty quickly (in about 90 seconds) but any subsequent set of updates hangs indefinitely -- I let it run for over 2 hours and it never completed. Do you have any idea what could be going on?
Christian may well be able to tell you specifically, but in general, it sounds like some resource is thrashing.
I think, starting from:
for $e in (db:open("source_db")/root/element) return ( if (exists(db:open("target_db")/root/element[@id = data($e/@id)])) then replace node db:open("target_db")/root/element[@id = data($e/@id)] with $e else insert node $e into db:open("target_db")/root )
I would write it as:
(: we want to open dbs once and only once because we're superstitious about that :) let $sourceDB as document-node() := db:open('source_db') let $targetDB as document-node() := db:open('target_db')
(: map id values to elements:) let $sourceElemMap as map(xs:string,element(element)) := map:merge( for $each in $sourceDB/root/element let $key as xs:string := $each/@id/string() return map:entry($key,$each) )
(: do it again; if the real data is this similar, this is a good candidate for a function :) let $targetElemMap as map(xs:string,element(element)) := map:merge( for $each in $targetDB/root/element let $key as xs:string := $each/@id/string() return map:entry($key,$each) )
(: get the ids as sequences of string values :) let $sourceIds as xs:string+ := map:keys($sourceElemMap) let $targetIds as xs:string+ := map:keys($targetElemMap)
(: do the update :) for $new in $sourceIds return if ($new = $targetIds) then replace node $targetElemMap($new) with $sourceElemMap($new) else insert node $sourceElemMap($new) into $targetDB/root
I think this does what you intend, and it might be more comprehensible to the optimizer. (I have this knack for confusing the optimizer, and it's made me cautious.)
It's not as elegant in expression and certainly not as compact. But it does somewhat separate the what-to-write and where-to-write steps, and gets the logic away from the path operator. (which, by the spec, is obliged to do a lot. If it's not needed, I find it can help to avoid it.)
With the following query, you can check if UPDINDEX is enabled in your database:
db:info('your-db')//updindex
But it’s a fine alternative to use maps, as proposed by Graydon.
On Tue, Jul 27, 2021 at 11:16 PM Graydon graydonish@gmail.com wrote:
On Tue, Jul 27, 2021 at 04:43:16PM -0400, Matthew Dziuban scripsit:
Thanks again, Christian. Regardless of whether I have the UPDINDEX and AUTOOPTIMIZE options enabled, I'm seeing that my first set of updates runs pretty quickly (in about 90 seconds) but any subsequent set of updates hangs indefinitely -- I let it run for over 2 hours and it never completed. Do you have any idea what could be going on?
Christian may well be able to tell you specifically, but in general, it sounds like some resource is thrashing.
I think, starting from:
for $e in (db:open("source_db")/root/element) return ( if (exists(db:open("target_db")/root/element[@id = data($e/@id)])) then replace node db:open("target_db")/root/element[@id = data($e/@id)] with $e else insert node $e into db:open("target_db")/root )
I would write it as:
(: we want to open dbs once and only once because we're superstitious about that :) let $sourceDB as document-node() := db:open('source_db') let $targetDB as document-node() := db:open('target_db')
(: map id values to elements:) let $sourceElemMap as map(xs:string,element(element)) := map:merge( for $each in $sourceDB/root/element let $key as xs:string := $each/@id/string() return map:entry($key,$each) )
(: do it again; if the real data is this similar, this is a good candidate for a function :) let $targetElemMap as map(xs:string,element(element)) := map:merge( for $each in $targetDB/root/element let $key as xs:string := $each/@id/string() return map:entry($key,$each) )
(: get the ids as sequences of string values :) let $sourceIds as xs:string+ := map:keys($sourceElemMap) let $targetIds as xs:string+ := map:keys($targetElemMap)
(: do the update :) for $new in $sourceIds return if ($new = $targetIds) then replace node $targetElemMap($new) with $sourceElemMap($new) else insert node $sourceElemMap($new) into $targetDB/root
I think this does what you intend, and it might be more comprehensible to the optimizer. (I have this knack for confusing the optimizer, and it's made me cautious.)
It's not as elegant in expression and certainly not as compact. But it does somewhat separate the what-to-write and where-to-write steps, and gets the logic away from the path operator. (which, by the spec, is obliged to do a lot. If it's not needed, I find it can help to avoid it.)
-- Graydon Saunders | graydonish@gmail.com Þæs oferéode, ðisses swá mæg. -- Deor ("That passed, so may this.")
basex-talk@mailman.uni-konstanz.de