Forwarding to the mailing list in order to share knowledge.
On Fri, Nov 12, 2021 at 1:41 PM BaseX Support support@basex.org wrote:
Hi France,
I’d need to get my hands on your code to tell you exactly where it’s best used, but I can give you some more details on the XQuery specification:
When creating new nodes in XQuery via node constructors [1], copies of all enclosed nodes will be created, and the copied nodes get new node identities. As a result, the following query yields false:
let $a := <a/> let $b := <b>{ $a }</b> return $b/a is $a
This step can be very expensive and memory consuming. If the option is enabled, child nodes will only be linked to the new parent nodes, and the upper query returns true.
As the option changes the semantics of XQuery, it should preferably be used in Pragmas.
Best, Christian
PS: Mails to our mailing list are preferred; this way, other users might benefit from the replies as well.
[1] https://www.w3.org/TR/xquery-31/#id-constructors
On Fri, Nov 12, 2021 at 2:13 PM France Baril france.baril@architextus.com wrote:
Can you give me more information about how copynode changes the behavior
of the xquery and where it is best used.
I see in the example that the pragma is on db:open. My process is:
- Read a document A from DB called lang that has references to other
documents in the same DB lang (where lang is a 4 letter code for a locale).
- Merge all the references into document A to create an aggregate.
- Send the aggregate through multiple functions (that use
copy-modify-return) that each resolve a type of reference (most references grab referenced content from a DB called global, but others grab it from the lang DB). These references do not grad entire documents, but smaller snippets within XML documents.
- Save the result in a DB called staging-lang (where lang is a 4 letter
code for a locale).
So should the pragma apply when reading the 1st document (1), when
reading the documents we aggregate into the 1st document (2), when grabbing the snippets (3) and/or when saving the end result in the staging DB (4)? Or maybe for all db:open() and db:attribute()/.. functions in this process?
On Fri, Nov 12, 2021 at 12:16 PM BaseX Support support@basex.org
wrote:
One more suggestion:
If node construction turns out to consume too much memory, it sometimes
helps to disable the COPYNODE option:
https://docs.basex.org/wiki/XQuery_Extensions#Database_Pragmas
France Baril france.baril@architextus.com schrieb am Fr., 12. Nov.
2021, 13:09:
Hi,
Thanks for your answer.
I tried rebuilding the document instead of using copies, I have implemented 3/4 of the functions that resolve references and I'm already at double the time I had before. So I will set that one aside as an unsuccessful alternative. If memory serves me correctly we might have moved from a transform that rebuilds the document to a copy-modify-return approach to improve performance over a year ago.
I will try grouping the references of the same names in the example above to limit the number of queries to the DB. If that still doesn't help, I will see if I can send you a good example without having to send too many of our.
We have a short term solution where we removed some references in references, which reduces substantially the number of items to resolve (80% improvement), but it does impact the user experience, so we are still looking into code-based solutions as opposed to (or to use in conjunction with) content-based solutions.
On Fri, Nov 5, 2021 at 5:22 PM BaseX Support support@basex.org
wrote:
Hi France,
Do you have some sample data that allows us to test your code?
If documents are pretty large, it’s sometimes faster to rebuild a document with node constructors instead of performing updates on it.
Best, Christian ____________________________________
We have a query that looks like this:
declare function content-refs:resolve-prompt-refs-new($node as
node(),
$lang as xs:string) as node()*{ let $result := copy $copy := $node modify( let $entries := $copy/descendant-or-self::*[@name-ref][name()='prompt-ref' or name()='gui-ctrl-ref' or name()='feature-ref' or name()='app-ref' (: or name()='screen-ref':)]
let $entries-hd :=
$copy/descendant-or-self::*[@id='T1700243243']/descendant-or-self::*[@name-ref][name()='prompt-ref'
or name()='gui-ctrl-ref' or name()='feature-ref' or name()='app-ref' (: or name()='screen-ref':)]
let $trace := trace('Prompts count: ' || count($entries)) let $trace := trace('Prompts in Hardware diagram: ' ||
count($entries-hd))
for $entry in $entries (:let $trace := trace('start processing entry'):) let $name := $entry/data(@name-ref) let $trace := if (exists($entry/ancestor::*[@id = 'T1700243243'])) then trace( $name , ' Promptref ') else () let $prompts-from-index := db:attribute('index-prompt-'
||
$lang, $name, 'name')/.. (:=> prof:time('index prompt attr: '):) (:let $prompts-from-index := db:open('index-prompt-' || $lang)//*[@name = $name] => prof:time('index prompt open: '):) let $prompts := for $prompt in $prompts-from-index let $original-elem-name := $entry/self::*/name() let $new-elem-name := switch ($original-elem-name) case 'prompt-ref' return $original-elem-name default return
substring-before($original-elem-name, '-ref')
return copy $prompt-renamed := $prompt modify( rename node $prompt-renamed as $new-elem-name ) return $prompt-renamed (:=> prof:time('index prompt
new
elem-name: '):) let $new-node := if (count($prompts) = 0) then <filter-group error="{concat("No target found in
for: ",
$entry/name(), '/@name-ref=',
$entry/@name-ref)}"/>
else <filter-group-inline>{ $prompts }</filter-group-inline> let $trace := ('Ready to replace old entry with
new-node')
return replace node $entry with $new-node (:=>
prof:time('index prompt new node: '):)
) return $copy (:=> prof:time('index prompt return copy: '):)
return $result
};
As you can see, we are using prof:time to see how quickly items are resolved. Querying to the db for each item goes fairly quickly (2 seconds). However that last 'return $copy' line, after all the replacements are processed takes between 11 and 25 minutes
depending
on the system. Memory usage is low, but the CPU usage goes to the roof.
We are updating a little over 110 000 items in this operation, so
it
is a big operation on a file of about 89000 indented lines. We are wondering if there is a way we could improve the performance.
Before
this operation occurs, we are processing the file multiple times to replace other items with very similar functions
(copy-modify.return),
they all go fairly quickly so it does seem that the culprit is the number of items being replaced.
-- France Baril Architecte documentaire / Documentation architect france.baril@architextus.com
-- France Baril Architecte documentaire / Documentation architect france.baril@architextus.com
-- France Baril Architecte documentaire / Documentation architect france.baril@architextus.com