I'm trying to ingest a sequence of 6293 strings, each 8 characters long, and create a map in which each key is an integer from 1 to 63, and each value is a subsequence of the 6293 strings, from ($key+1) * 100) to that starting value + 99.
I wrote a function to do this recursively (so I thought), but it runs and runs (I've given it over 5 minutes), and so I can see I borked something. I wrote a function to output the starting and ending sequence indices for each execution of the function, but it (as the only change to the first function) quickly causes a stack overflow:
*Error:* Stack Overflow: Try tail recursion?
Since the recursive call was the last expression in the function, I thought I was doing tail recursion.
I'm running 11.6.0. The code with explanatory comments follows. I'd be grateful for any corrections and insights.
declare function local:write_results($elems as element()*, $count as xs:integer) { let $dir_path := '/home/cfbearden/projects/mesh_for_pure/20241220/data/' let $count_as_str := xs:string($count) let $content_out := <PubmedArticleSet>{$elems}</PubmedArticleSet> let $path := $dir_path || 'pmed2022_' || $count_as_str || '.xml' let $params := { 'method' : 'xml', 'indent' : 'yes' } return file:write($path, $elems, $params) };
declare function local:by_hundreds($pmids as xs:string*, $pmid_map as map(*)) as map(*) { let $key_count := count(map:keys($pmid_map)) return (: base case: each value except the last should have 100 items in it; the last should have <= 100; in this case we return the map goal is to build a map with 63 keys, with each value a sequence of 100 or 93 (in the last case) strings :) if ((($key_count + 1) * 100) >= count($pmids)) then $pmid_map else (: starting index for subsequence() :) let $start := 1 + ($key_count * 100) (: ending index for subsequence() :) let $end := $start + 99 (: value of the map key for this 100 items :) let $key := $key_count + 1 (: attempt to log all recurrent executions to files let $foo := local:write_results(<res><start>{$start}</start><end>{$end}</end></res>, $key_count) :) (: create a map with $key as key and the subsequence from n to n + 99 of $pmids and merge it with input accumulator map to make a new map :) let $pmid_map_new := map:merge($pmid_map, map { $key : subsequence($pmids, $start, $end) }) (: recur; the number of keys in $pmid_map_new will enable the next execution to determine which items from $pmids to use :) return local:by_hundreds($pmids, $pmid_map_new) };
(: The data format is simple: <DATA_RECORDS> <DATA_RECORD> <pub_uuid>0008980a-f1f3-40bf-8c13-cb0b79dfa81e</pub_uuid> <pmid>35732127</pmid> </DATA_RECORD> ... </DATA_RECORDS> :) let $pure_2022_db := collection('pure_2022')
(: yields a sequence of 6293 strings, all 8 characters long :) let $pmids := ( for $drec in $pure_2022_db/DATA_RECORDS/DATA_RECORD let $pmid := $drec/pmid/text() return $pmid )
let $result := local:by_hundreds($pmids, {}) return $result
All the best, Chuck Bearden
Hello Chuck,
Maybe I will have time to look at your code later, but the first thing that came to mind was to use `group by`:
map:merge( for $pmidset at $i in $pmids group by $chunk as xs:integer := ($i idiv 100) return map:entry($chunk, $pmidset) )
This should give you the map you expect.
Best regards, Nico Verwer
Hello Chuck,
The reason why you have infinite recursion is, that the signature of `map:merge` is
map:merge($maps as map(*)*, $options as map(*)) as map(*)
The first parameter is a sequence of maps, and the second parameter is a map with options. This means that in your code
map:merge($pmid_map, map { $key : subsequence($pmids, $start, $end) })
the map containing the subsequence is interpreted as a (meaningless) map with options, and you only keep the empty map that you started with. To fix this, insert parentheses like so:
map:merge( ( $pmid_map , map {$key : subsequence($pmids, $start, $end)} ) )
The $options parameter is optional, so I omitted it. However, there are more mistakes, which I will go into below. And a better approach is to use the /for - group by - return/ that I posted earlier. This is a lot faster, more concise, and more readable in my opinion. My comments below are only meant to highlight some learning points.
declare function local:by_hundreds($pmids as xs:string*, $pmid_map as map(*)) as map(*) { let $key_count := count(map:keys($pmid_map))
Better is map:size($pmid_map)
return (: base case: each value except the last should have 100 items in it; the last should have <= 100; in this case we return the map goal is to build a map with 63 keys, with each value a sequence of 100 or 93 (in the last case) strings :) if ((($key_count + 1) * 100) >= count($pmids)) then $pmid_map
This means that you will miss the last 93 strings, because /$key_count * 100/ will be less then /count($pmids)/, so there are strings left.
else (: starting index for subsequence() :) let $start := 1 + ($key_count * 100) (: ending index for subsequence() :) let $end := $start + 99 (: value of the map key for this 100 items :) let $key := $key_count + 1
If you pass a subsequence of $pmids as the first parameter, instead of passing the whole list every time, there is less computation to do, and the base case becomes easier to detect.
(: attempt to log all recurrent executions to files let $foo := local:write_results(<res><start>{$start}</start><end>{$end}</end></res>, $key_count) :) (: create a map with $key as key and the subsequence from n to n
- 99
of $pmids and merge it with input accumulator map to make a new map :) let $pmid_map_new := map:merge($pmid_map, map { $key : subsequence($pmids, $start, $end) })
Since there is only one key in this map, it is better to use /map:entry($key, subsequence($pmids, $start, $end))/.
(: recur; the number of keys in $pmid_map_new will enable the next execution to determine which items from $pmids to use :) return local:by_hundreds($pmids, $pmid_map_new) };
The recursive pattern you use can be written more efficiently as a fold-left. See the excellent XQuery book by Priscilla Walmsley for examples.
But as I said, the conventional way to do this in XQuery would be
map:merge( for $pmidset at $i in $pmids group by $chunk as xs:integer := ($i idiv 100) return map:entry($chunk, $pmidset) )
Happy XQuerying!
Best regards, Nico Verwer
Thank you Nico for taking the time to analyze and comment on my code. I will study your answer, but not this evening :-) Much appreciated!
All the best, Chuck
On Fri, Dec 20, 2024 at 1:12 PM Nico Verwer (Rakensi) nverwer@rakensi.com wrote:
Hello Chuck,
The reason why you have infinite recursion is, that the signature of `map:merge` is
map:merge($maps as map(*)*, $options as map(*)) as map(*)
The first parameter is a sequence of maps, and the second parameter is a map with options. This means that in your code
map:merge($pmid_map, map { $key : subsequence($pmids, $start, $end) })
the map containing the subsequence is interpreted as a (meaningless) map with options, and you only keep the empty map that you started with. To fix this, insert parentheses like so:
map:merge( ( $pmid_map , map {$key : subsequence($pmids, $start, $end)} ) )
The $options parameter is optional, so I omitted it. However, there are more mistakes, which I will go into below. And a better approach is to use the *for - group by - return* that I posted earlier. This is a lot faster, more concise, and more readable in my opinion. My comments below are only meant to highlight some learning points.
declare function local:by_hundreds($pmids as xs:string*, $pmid_map as map(*))
as map(*) { let $key_count := count(map:keys($pmid_map))
Better is map:size($pmid_map)
return (: base case: each value except the last should have 100 items in it; the last should have <= 100; in this case we return the map goal is to build a map with 63 keys, with each value a sequence of 100 or 93 (in the last case) strings :) if ((($key_count + 1) * 100) >= count($pmids)) then $pmid_map
This means that you will miss the last 93 strings, because *$key_count * 100* will be less then *count($pmids)*, so there are strings left.
else (: starting index for subsequence() :) let $start := 1 + ($key_count * 100) (: ending index for subsequence() :) let $end := $start + 99 (: value of the map key for this 100 items :) let $key := $key_count + 1
If you pass a subsequence of $pmids as the first parameter, instead of passing the whole list every time, there is less computation to do, and the base case becomes easier to detect.
(: attempt to log all recurrent executions to files let $foo :=
local:write_results(<res><start>{$start}</start><end>{$end}</end></res>, $key_count) :) (: create a map with $key as key and the subsequence from n to n + 99 of $pmids and merge it with input accumulator map to make a new map :) let $pmid_map_new := map:merge($pmid_map, map { $key : subsequence($pmids, $start, $end) })
Since there is only one key in this map, it is better to use *map:entry($key, subsequence($pmids, $start, $end))*.
(: recur; the number of keys in $pmid_map_new will enable the next execution to determine which items from $pmids to use :) return local:by_hundreds($pmids, $pmid_map_new)
};
The recursive pattern you use can be written more efficiently as a fold-left. See the excellent XQuery book by Priscilla Walmsley for examples.
But as I said, the conventional way to do this in XQuery would be
map:merge( for $pmidset at $i in $pmids group by $chunk as xs:integer := ($i idiv 100) return map:entry($chunk, $pmidset) )
Happy XQuerying!
Best regards, Nico Verwer
Hi Nico,
Thank you again for your detailed responses. I'm just getting back into the office. I do appreciate your taking the time to analyze my code.
Your for loop to replace my recursive apparatus taught me two things: I was previously unfamiliar with the at clause in the for loop, and I didn't know the idiv operator before. Thank you!
Regarding the signature for map:merge(): Ouch! I have used that function many times. I should have known better.
Thank you also for the advice on using the for loop instead of the recursive approach.
Regarding
return (: base case: each value except the last should have 100 items in it; the last should have <= 100; in this case we return the map goal is to build a map with 63 keys, with each value a sequence of 100 or 93 (in the last case) strings :) if ((($key_count + 1) * 100) >= count($pmids)) then $pmid_map
I thought because I was multiplying $key_count + 1, rather than $key_count, by 100, the condition would be true after the last 93 elements of the sequence. I thought there would be 63 key/value pairs after the final 93 are added, so the base case would be triggered by
(63 + 1) * 100 >= 6293
Perhaps I'm off by one. Though I've abandoned the recursive function altogether in favor of your for loop.
Again thanks, Chuck
basex-talk@mailman.uni-konstanz.de