Re: [basex-talk] improving query performance

22 Aug 2020

      This is vexing - it seems as though the mechanism that provides the
necessary "filtering" is the very thing that slows the execution down so
much. This wouldn't have been obvious from the single example document I
sent earlier, but each document stands alone: all of the searching and
reference linking done for each TrackRelease in a NewReleaseMessage should
only refer to other nodes in that same NewReleaseMessage.
In my query, I started out with "for $r in /ernm:NewReleaseMessage" and I
used $r on the right hand side of the subsequent for statements. It seems
like without that, the execution is quick, but all the results from every
document are getting matched to each other. With it, the results are
correct, but the execution time shoots way up. In case any of you still
have any patience for this question (and thanks again for everything so
far!), I've attached a small sample set of 6 documents. The desired number
of results from the query is 70 (which is the number of TrackReleases from
all the documents combined), and the query that I've adapted from
Christian's ddex2.xq which returns the right number of results is the
following:
declare namespace ernm = 'http://ddex.net/xml/ern/411';
(: declare context item := db:open('ddex'); :)
for $r in /ernm:NewReleaseMessage
for $party in $r/PartyList/Party[
  PartyReference/text() =
  $r/ReleaseList/TrackRelease/ReleaseLabelReference
]
for $track_release in $r/ReleaseList/TrackRelease[
  ReleaseLabelReference/text() =
  $r/PartyList/Party/PartyReference
]
for $sound_recording in $r/ResourceList/SoundRecording[
  ResourceReference/text() =
  $track_release/ReleaseResourceReference
]
for $release in $r/ReleaseList/Release[
ResourceGroup/ResourceGroup/ResourceGroupContentItem/ReleaseResourceReference/text()
=
  $track_release/ReleaseResourceReference
]
return <identity>
  <isrc>{ $track_release/ReleaseId/ISRC/text() }</isrc>
  <artist>{ fn:string-join($sound_recording/DisplayArtistName, '/')
}</artist>
  <title>{ $sound_recording/DisplayTitleText/text() }</title>
  <album>{ $release/DisplayTitleText/text() }</album>
  <icpn>{ $release/ReleaseId/ICPN/text() }</icpn>
  <sublabel>{ $party/PartyName/FullName/text() }</sublabel>
</identity>
On Sat, Aug 22, 2020 at 7:42 AM Christian Grün christian.gruen@gmail.com
wrote:
...
That's good to hear. My rewritten query was based on the query of your
first post, and I already guessed that all the nested loops are not really
wanted or required.
Looking forward to learning about your next insights,
Christian
Bill Osmond bill@with.in schrieb am Sa., 22. Aug. 2020, 16:31:
...
Great e-mail messages to wake up to! Thank you for the further
explanation Liam, and Christian the examples you provided were considerably
faster:

my fastest was 70k ms
your ddex.xq was 35kms
your ddex2.xq was 10kms!

There is only one issue: both ddex.xq and ddex2.xq seem to return many
more results than expected (cartesian product somewhere perhaps)
When I run the queries against a smaller database - one with just 6 of
the DDEX documents, my query returns 70 results which matches the number of
TrackReleases, but both ddex.xq and ddex2.xq return 303,134 results. It
looks like a separate "copy" of the output is being created for every Party
in the PartyList, when really there should be only one (specified by the
PartyReference). But this is very promising - if it takes 10 seconds to
return a massively expanded version of the data, then perhaps this will get
to <1000ms!
On Sat, Aug 22, 2020 at 4:07 AM Christian Grün christian.gruen@gmail.com
wrote:
...
Hi Bill,
Feel free to run the attached queries; maybe they give you a faster
result.
Your use case was interesting. It gave me some additional ideas on how
to speed up queries (by reordering consecutive 'for' clauses that do
not change the result).
Cheers,
Christian
On Sat, Aug 22, 2020 at 6:10 AM Liam R. E. Quin liam@fromoldbooks.org
wrote:
...
On Fri, 2020-08-21 at 17:28 -0700, Bill Osmond wrote:
...
I'm beginning to think that perhaps my performance hopes were a bit
too
inflated, given the size and complexity of our database. After a
fresh
optimization, and with -Xms2g -Xmx10g, the following query takes
1492ms:
[...]
First note - there are in fact no loops in your query. Although "for"
is used to introduce a loop in many procedural languages, it does nto
do so in XQuery (nor does for-each in XSLT).
In fact, it's closer to what SQL people know as a join.
It's making a stream of n-tuples, and then evaluating the inner
expression for each tuple, so that
for $a in (  'a', 'b', 'c')
  for $b in (1 to 5)
    return $a || '-' || $b
produces 15 lines of output,
a-1, a-2, 1-3, a-4, a-6, b-1, and so on.
You can see the BaseX query plan for your query already moves your
where clauses as i did by hand, because BaseX is awesome.
To make the query fast, you either need to reduce the number of tuples,
and henve the number of times the expressions are evaluated, or you
need to reduce the cost of creating the tuples.
Moving the where clauses was my attempt to reduce the number of tuples.
Adding an index might reduce the cost of making the tuples, so i'd
certainly try that.
If the input document is sorted, you might be able to construct
something recursively (e.g. with fold-left) or use grouping or
windowing to process $parties in groups, which may help considerably.
Without seeing the data, that's only a guess.
Liam
--
Liam Quin, https://www.delightfulcomputing.com/
Available for XML/Document/Information Architecture/XSLT/
XSL/XQuery/Web/Text Processing/A11Y training, work & consulting.
Barefoot Web-slave, antique illustrations:
http://www.fromoldbooks.org
...

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

Re: [basex-talk] improving query performance