Hi all,
I'm working on a query that reads from two databases -- OriginalDB and UpdateDB. The structure of each database is the same:
<data> <item id="1">...</item> <item id="2">...</item> ... </data>
OriginalDB has 2,305,570 items and UpdateDB has 307,019.
The query gets all distinct item ids from the two databases, then gets the item with a matching id (if there is one) from each database, and returns the id and both nodes in a JSON object. Here's the query:
let $originalDB := db:get('OriginalDB')/data/item let $updateDB := db:get('UpdateDB')/data/item for $id in distinct-values(($originalDB/@id, $updateDB/@id)) let $original := $originalDB[@id = $id] let $update := $updateDB[@id = $id] return json:serialize(map { 'id': xs:long($id), 'original': $original, 'update': $update }, map { 'indent': 'no' })
In its current state this query is very slow -- it's returned only 35 JSON objects in ~30 minutes. How can I go about optimizing it to best take advantage of each database's indices?
Thanks in advance, Matt
Hi Matt,
Since I've started using BaseX, I've had great experiences with map performance. Would this version work better, where you first create a map version for all items in both databases, and next look up the corresponding keys?
let $originalDB := db:get('OriginalDB')/data/item let $updateDB := db:get('UpdateDB')/data/item
let $originalDB.map := map:merge( for $entry in $originalDB return map:entry($entry/@id, $entry) ) let $updateDB.map := map:merge( for $entry in $updateDB return map:entry($entry/@id, $entry) ) for $id in ($originalDB.map => map:keys(), $updateDB.map => map:keys()) => distinct-values() return json:serialize(map { 'id': xs:long($id), 'original': $originalDB.map($id), 'update': $updateDB.map($id) }, map { 'indent': 'no' })
Best,
Ron
On 3/06/2025 23:09, Matt Dziuban wrote:
Hi all,
I'm working on a query that reads from two databases -- OriginalDB and UpdateDB. The structure of each database is the same:
<data> <item id="1">...</item> <item id="2">...</item> ... </data>
OriginalDB has 2,305,570 items and UpdateDB has 307,019.
The query gets all distinct item ids from the two databases, then gets the item with a matching id (if there is one) from each database, and returns the id and both nodes in a JSON object. Here's the query:
let $originalDB := db:get('OriginalDB')/data/item let $updateDB := db:get('UpdateDB')/data/item for $id in distinct-values(($originalDB/@id, $updateDB/@id)) let $original := $originalDB[@id = $id] let $update := $updateDB[@id = $id] return json:serialize(map { 'id': xs:long($id), 'original': $original, 'update': $update }, map { 'indent': 'no' })
In its current state this query is very slow -- it's returned only 35 JSON objects in ~30 minutes. How can I go about optimizing it to best take advantage of each database's indices?
Thanks in advance, Matt
Hi Matt,
Ron has already given you a perfect answer on how to utilize maps. Another slightly slower solution is to address the database twice, and let the compiler do its job:
let $originalDB := db:get('OriginalDB')/data/item let $updateDB := db:get('UpdateDB')/data/item for $id in distinct-values(($originalDB/@id, $updateDB/@id)) let $original := db:get('OriginalDB')/data/item[@id = $id] let $update := db:get('UpdateDB')/data/item[@id = $id] return json:serialize(map { 'id': xs:long($id), 'original': $original, 'update': $update })
This may seem counterintuitive at first glance, but the expression "db:get('OriginalDB')/data/item[@id = $id]" is something that the optimizer will rewrite for index access. It is equivalent to:
let $original := db:attribute("OriginalDB", $id)/self::attribute(id)/parent::item let $update := db:attribute("UpdateDB", $id)/self::attribute(id)/parent::item
Hope this helps, Christian
On Tue, Jun 3, 2025 at 11:10 PM Matt Dziuban mrdziuban@gmail.com wrote:
Hi all,
I'm working on a query that reads from two databases -- OriginalDB and UpdateDB. The structure of each database is the same:
<data> <item id="1">...</item> <item id="2">...</item> ... </data>
OriginalDB has 2,305,570 items and UpdateDB has 307,019.
The query gets all distinct item ids from the two databases, then gets the item with a matching id (if there is one) from each database, and returns the id and both nodes in a JSON object. Here's the query:
let $originalDB := db:get('OriginalDB')/data/item let $updateDB := db:get('UpdateDB')/data/item for $id in distinct-values(($originalDB/@id, $updateDB/@id)) let $original := $originalDB[@id = $id] let $update := $updateDB[@id = $id] return json:serialize(map { 'id': xs:long($id), 'original': $original, 'update': $update }, map { 'indent': 'no' })
In its current state this query is very slow -- it's returned only 35 JSON objects in ~30 minutes. How can I go about optimizing it to best take advantage of each database's indices?
Thanks in advance, Matt
Thank you Ron and Christian for the suggestions! I’ll give both a try. Will the map solution cause BaseX to hold every item from both databases in memory?
Best, Matt
On Wed, Jun 4, 2025 at 3:27 AM Christian Grün christian.gruen@gmail.com wrote:
Hi Matt,
Ron has already given you a perfect answer on how to utilize maps. Another slightly slower solution is to address the database twice, and let the compiler do its job:
let $originalDB := db:get('OriginalDB')/data/item let $updateDB := db:get('UpdateDB')/data/item for $id in distinct-values(($originalDB/@id, $updateDB/@id)) let $original := db:get('OriginalDB')/data/item[@id = $id] let $update := db:get('UpdateDB')/data/item[@id = $id] return json:serialize(map { 'id': xs:long($id), 'original': $original, 'update': $update })
This may seem counterintuitive at first glance, but the expression "db:get('OriginalDB')/data/item[@id = $id]" is something that the optimizer will rewrite for index access. It is equivalent to:
let $original := db:attribute("OriginalDB", $id)/self::attribute(id)/parent::item let $update := db:attribute("UpdateDB", $id)/self::attribute(id)/parent::item
Hope this helps, Christian
On Tue, Jun 3, 2025 at 11:10 PM Matt Dziuban mrdziuban@gmail.com wrote:
Hi all,
I'm working on a query that reads from two databases -- OriginalDB and UpdateDB. The structure of each database is the same:
<data> <item id="1">...</item> <item id="2">...</item> ... </data>
OriginalDB has 2,305,570 items and UpdateDB has 307,019.
The query gets all distinct item ids from the two databases, then gets the item with a matching id (if there is one) from each database, and returns the id and both nodes in a JSON object. Here's the query:
let $originalDB := db:get('OriginalDB')/data/item let $updateDB := db:get('UpdateDB')/data/item for $id in distinct-values(($originalDB/@id, $updateDB/@id)) let $original := $originalDB[@id = $id] let $update := $updateDB[@id = $id] return json:serialize(map { 'id': xs:long($id), 'original': $original, 'update': $update }, map { 'indent': 'no' })
In its current state this query is very slow -- it's returned only 35 JSON objects in ~30 minutes. How can I go about optimizing it to best take advantage of each database's indices?
Thanks in advance, Matt
Thank you Ron and Christian for the suggestions! I’ll give both a try. Will the map solution cause BaseX to hold every item from both databases in memory?
It’s just references to the database nodes. To learn more about the memory consumption of particular operations, you can attach a profile function call to map:merge and check out the output in the info view panel of the GUI (or see what the command line returns):
let $originalDB.map := map:merge( for $entry in $originalDB return map:entry($entry/@id, $entry) ) => prof:memory()
Good to know, thanks again!
On Wed, Jun 4, 2025 at 7:27 AM Christian Grün christian.gruen@gmail.com wrote:
Thank you Ron and Christian for the suggestions! I’ll give both a try.
Will the map solution cause BaseX to hold every item from both databases in memory?
It’s just references to the database nodes. To learn more about the memory consumption of particular operations, you can attach a profile function call to map:merge and check out the output in the info view panel of the GUI (or see what the command line returns):
let $originalDB.map := map:merge( for $entry in $originalDB return map:entry($entry/@id, $entry) ) => prof:memory()
basex-talk@mailman.uni-konstanz.de