This is vexing - it seems as though the mechanism that provides the necessary "filtering" is the very thing that slows the execution down so much. This wouldn't have been obvious from the single example document I sent earlier, but each document stands alone: all of the searching and reference linking done for each TrackRelease in a NewReleaseMessage should only refer to other nodes in that same NewReleaseMessage.
In my query, I started out with "for $r in /ernm:NewReleaseMessage" and I used $r on the right hand side of the subsequent for statements. It seems like without that, the execution is quick, but all the results from every document are getting matched to each other. With it, the results are correct, but the execution time shoots way up. In case any of you still have any patience for this question (and thanks again for everything so far!), I've attached a small sample set of 6 documents. The desired number of results from the query is 70 (which is the number of TrackReleases from all the documents combined), and the query that I've adapted from Christian's ddex2.xq which returns the right number of results is the following:
declare namespace ernm = 'http://ddex.net/xml/ern/411'; (: declare context item := db:open('ddex'); :)
for $r in /ernm:NewReleaseMessage
for $party in $r/PartyList/Party[ PartyReference/text() = $r/ReleaseList/TrackRelease/ReleaseLabelReference ] for $track_release in $r/ReleaseList/TrackRelease[ ReleaseLabelReference/text() = $r/PartyList/Party/PartyReference ] for $sound_recording in $r/ResourceList/SoundRecording[ ResourceReference/text() = $track_release/ReleaseResourceReference ] for $release in $r/ReleaseList/Release[
ResourceGroup/ResourceGroup/ResourceGroupContentItem/ReleaseResourceReference/text() = $track_release/ReleaseResourceReference ] return <identity> <isrc>{ $track_release/ReleaseId/ISRC/text() }</isrc> <artist>{ fn:string-join($sound_recording/DisplayArtistName, '/') }</artist> <title>{ $sound_recording/DisplayTitleText/text() }</title> <album>{ $release/DisplayTitleText/text() }</album> <icpn>{ $release/ReleaseId/ICPN/text() }</icpn> <sublabel>{ $party/PartyName/FullName/text() }</sublabel> </identity>
On Sat, Aug 22, 2020 at 7:42 AM Christian Grün christian.gruen@gmail.com wrote:
That's good to hear. My rewritten query was based on the query of your first post, and I already guessed that all the nested loops are not really wanted or required.
Looking forward to learning about your next insights, Christian
Bill Osmond bill@with.in schrieb am Sa., 22. Aug. 2020, 16:31:
Great e-mail messages to wake up to! Thank you for the further explanation Liam, and Christian the examples you provided were considerably faster:
- my fastest was 70k ms
- your ddex.xq was 35kms
- your ddex2.xq was 10kms!
There is only one issue: both ddex.xq and ddex2.xq seem to return many more results than expected (cartesian product somewhere perhaps)
When I run the queries against a smaller database - one with just 6 of the DDEX documents, my query returns 70 results which matches the number of TrackReleases, but both ddex.xq and ddex2.xq return 303,134 results. It looks like a separate "copy" of the output is being created for every Party in the PartyList, when really there should be only one (specified by the PartyReference). But this is very promising - if it takes 10 seconds to return a massively expanded version of the data, then perhaps this will get to <1000ms!
On Sat, Aug 22, 2020 at 4:07 AM Christian Grün christian.gruen@gmail.com wrote:
Hi Bill,
Feel free to run the attached queries; maybe they give you a faster result.
Your use case was interesting. It gave me some additional ideas on how to speed up queries (by reordering consecutive 'for' clauses that do not change the result).
Cheers, Christian
On Sat, Aug 22, 2020 at 6:10 AM Liam R. E. Quin liam@fromoldbooks.org wrote:
On Fri, 2020-08-21 at 17:28 -0700, Bill Osmond wrote:
I'm beginning to think that perhaps my performance hopes were a bit too inflated, given the size and complexity of our database. After a fresh optimization, and with -Xms2g -Xmx10g, the following query takes 1492ms:
[...]
First note - there are in fact no loops in your query. Although "for" is used to introduce a loop in many procedural languages, it does nto do so in XQuery (nor does for-each in XSLT).
In fact, it's closer to what SQL people know as a join.
It's making a stream of n-tuples, and then evaluating the inner expression for each tuple, so that
for $a in ( 'a', 'b', 'c') for $b in (1 to 5) return $a || '-' || $b
produces 15 lines of output, a-1, a-2, 1-3, a-4, a-6, b-1, and so on.
You can see the BaseX query plan for your query already moves your where clauses as i did by hand, because BaseX is awesome.
To make the query fast, you either need to reduce the number of tuples, and henve the number of times the expressions are evaluated, or you need to reduce the cost of creating the tuples.
Moving the where clauses was my attempt to reduce the number of tuples. Adding an index might reduce the cost of making the tuples, so i'd certainly try that.
If the input document is sorted, you might be able to construct something recursively (e.g. with fold-left) or use grouping or windowing to process $parties in groups, which may help considerably.
Without seeing the data, that's only a guess.
Liam
-- Liam Quin, https://www.delightfulcomputing.com/ Available for XML/Document/Information Architecture/XSLT/ XSL/XQuery/Web/Text Processing/A11Y training, work & consulting. Barefoot Web-slave, antique illustrations: