Optimizing query that reads nodes from two databases

List overview All Threads
Download

newer

older

running several updating queries...

BaseX 12.0 is tripping on the XSLT...

Matt Dziuban

3 Jun 2025 3 Jun '25

5:09 p.m.

Hi all,

I'm working on a query that reads from two databases -- OriginalDB and UpdateDB. The structure of each database is the same:

OriginalDB has 2,305,570 items and UpdateDB has 307,019.

The query gets all distinct item ids from the two databases, then gets the item with a matching id (if there is one) from each database, and returns the id and both nodes in a JSON object. Here's the query:

let $originalDB := db:get('OriginalDB')/data/item let $updateDB := db:get('UpdateDB')/data/item for $id in distinct-values(($originalDB/@id, $updateDB/@id)) let $original := $originalDB[@id = $id] let $update := $updateDB[@id = $id] return json:serialize(map { 'id': xs:long($id), 'original': $original, 'update': $update }, map { 'indent': 'no' })

In its current state this query is very slow -- it's returned only 35 JSON objects in ~30 minutes. How can I go about optimizing it to best take advantage of each database's indices?

Thanks in advance, Matt

Attachments:

attachment.html (text/html — 2.0 KB)

Show replies by date

ron.vdbranden＠gmail.com

4 Jun 4 Jun

3:06 a.m.

Hi Matt,

Since I've started using BaseX, I've had great experiences with map performance. Would this version work better, where you first create a map version for all items in both databases, and next look up the corresponding keys?

let $originalDB := db:get('OriginalDB')/data/item let $updateDB := db:get('UpdateDB')/data/item

let $originalDB.map := map:merge( for $entry in $originalDB return map:entry($entry/@id, $entry) ) let $updateDB.map := map:merge( for $entry in $updateDB return map:entry($entry/@id, $entry) ) for $id in ($originalDB.map => map:keys(), $updateDB.map => map:keys()) => distinct-values() return json:serialize(map { 'id': xs:long($id), 'original': $originalDB.map($id), 'update': $updateDB.map($id) }, map { 'indent': 'no' })

Best,

Ron

On 3/06/2025 23:09, Matt Dziuban wrote:

...

Hi all,

I'm working on a query that reads from two databases -- OriginalDB and UpdateDB. The structure of each database is the same:

<data> <item id="1">...</item> <item id="2">...</item> ... </data>

OriginalDB has 2,305,570 items and UpdateDB has 307,019.

The query gets all distinct item ids from the two databases, then gets the item with a matching id (if there is one) from each database, and returns the id and both nodes in a JSON object. Here's the query:

let $originalDB := db:get('OriginalDB')/data/item let $updateDB := db:get('UpdateDB')/data/item for $id in distinct-values(($originalDB/@id, $updateDB/@id)) let $original := $originalDB[@id = $id] let $update := $updateDB[@id = $id] return json:serialize(map { 'id': xs:long($id), 'original': $original, 'update': $update }, map { 'indent': 'no' })

In its current state this query is very slow -- it's returned only 35 JSON objects in ~30 minutes. How can I go about optimizing it to best take advantage of each database's indices?

Thanks in advance, Matt

Christian Grün

3:27 a.m.

Hi Matt,

Ron has already given you a perfect answer on how to utilize maps. Another slightly slower solution is to address the database twice, and let the compiler do its job:

let $originalDB := db:get('OriginalDB')/data/item let $updateDB := db:get('UpdateDB')/data/item for $id in distinct-values(($originalDB/@id, $updateDB/@id)) let $original := db:get('OriginalDB')/data/item[@id = $id] let $update := db:get('UpdateDB')/data/item[@id = $id] return json:serialize(map { 'id': xs:long($id), 'original': $original, 'update': $update })

This may seem counterintuitive at first glance, but the expression "db:get('OriginalDB')/data/item[@id = $id]" is something that the optimizer will rewrite for index access. It is equivalent to:

let $original := db:attribute("OriginalDB", $id)/self::attribute(id)/parent::item let $update := db:attribute("UpdateDB", $id)/self::attribute(id)/parent::item

Hope this helps, Christian

On Tue, Jun 3, 2025 at 11:10 PM Matt Dziuban mrdziuban@gmail.com wrote:

...

Hi all,

I'm working on a query that reads from two databases -- OriginalDB and UpdateDB. The structure of each database is the same:

<data> <item id="1">...</item> <item id="2">...</item> ... </data>

OriginalDB has 2,305,570 items and UpdateDB has 307,019.

The query gets all distinct item ids from the two databases, then gets the item with a matching id (if there is one) from each database, and returns the id and both nodes in a JSON object. Here's the query:

let $originalDB := db:get('OriginalDB')/data/item let $updateDB := db:get('UpdateDB')/data/item for $id in distinct-values(($originalDB/@id, $updateDB/@id)) let $original := $originalDB[@id = $id] let $update := $updateDB[@id = $id] return json:serialize(map { 'id': xs:long($id), 'original': $original, 'update': $update }, map { 'indent': 'no' })

In its current state this query is very slow -- it's returned only 35 JSON objects in ~30 minutes. How can I go about optimizing it to best take advantage of each database's indices?

Thanks in advance, Matt

Matt Dziuban

7:16 a.m.

Thank you Ron and Christian for the suggestions! I’ll give both a try. Will the map solution cause BaseX to hold every item from both databases in memory?

Best, Matt

On Wed, Jun 4, 2025 at 3:27 AM Christian Grün christian.gruen@gmail.com wrote:

...

Hi Matt,

Ron has already given you a perfect answer on how to utilize maps. Another slightly slower solution is to address the database twice, and let the compiler do its job:

let $originalDB := db:get('OriginalDB')/data/item let $updateDB := db:get('UpdateDB')/data/item for $id in distinct-values(($originalDB/@id, $updateDB/@id)) let $original := db:get('OriginalDB')/data/item[@id = $id] let $update := db:get('UpdateDB')/data/item[@id = $id] return json:serialize(map { 'id': xs:long($id), 'original': $original, 'update': $update })

This may seem counterintuitive at first glance, but the expression "db:get('OriginalDB')/data/item[@id = $id]" is something that the optimizer will rewrite for index access. It is equivalent to:

let $original := db:attribute("OriginalDB", $id)/self::attribute(id)/parent::item let $update := db:attribute("UpdateDB", $id)/self::attribute(id)/parent::item

Hope this helps, Christian

On Tue, Jun 3, 2025 at 11:10 PM Matt Dziuban mrdziuban@gmail.com wrote:

...
Hi all,

I'm working on a query that reads from two databases -- OriginalDB and UpdateDB. The structure of each database is the same:

<data> <item id="1">...</item> <item id="2">...</item> ... </data>

OriginalDB has 2,305,570 items and UpdateDB has 307,019.

The query gets all distinct item ids from the two databases, then gets the item with a matching id (if there is one) from each database, and returns the id and both nodes in a JSON object. Here's the query:

let $originalDB := db:get('OriginalDB')/data/item let $updateDB := db:get('UpdateDB')/data/item for $id in distinct-values(($originalDB/@id, $updateDB/@id)) let $original := $originalDB[@id = $id] let $update := $updateDB[@id = $id] return json:serialize(map { 'id': xs:long($id), 'original': $original, 'update': $update }, map { 'indent': 'no' })

In its current state this query is very slow -- it's returned only 35 JSON objects in ~30 minutes. How can I go about optimizing it to best take advantage of each database's indices?

Thanks in advance, Matt

Christian Grün

7:27 a.m.

...

Thank you Ron and Christian for the suggestions! I’ll give both a try. Will the map solution cause BaseX to hold every item from both databases in memory?

It’s just references to the database nodes. To learn more about the memory consumption of particular operations, you can attach a profile function call to map:merge and check out the output in the info view panel of the GUI (or see what the command line returns):

let $originalDB.map := map:merge( for $entry in $originalDB return map:entry($entry/@id, $entry) ) => prof:memory()

Matt Dziuban

8:25 a.m.

Good to know, thanks again!

On Wed, Jun 4, 2025 at 7:27 AM Christian Grün christian.gruen@gmail.com wrote:

...

Thank you Ron and Christian for the suggestions! I’ll give both a try.

...
Will the map solution cause BaseX to hold every item from both databases in memory?

It’s just references to the database nodes. To learn more about the memory consumption of particular operations, you can attach a profile function call to map:merge and check out the output in the info view panel of the GUI (or see what the command line returns):

let $originalDB.map := map:merge( for $entry in $originalDB return map:entry($entry/@id, $entry) ) => prof:memory()

-- Matthew R. Dziuban mattdziuban.com 703-973-6717 mrdziuban@gmail.com

Age (days ago)

Last active (days ago)

basex-talk@mailman.uni-konstanz.de

5 comments

3 participants

tags (0)

participants (3)

Christian Grün
Matt Dziuban
ron.vdbranden＠gmail.com