Hello,
as a rather unexperienced xquery and basex user, I'm puzzled by the timings of running the same xquery on two very similar xml files.
For an academic paper, I'm trying to benchmark finding certain successive (windowed) <item> tags inside a <seq> tag. This is the full query:
declare function local:check-sequence($items as element(item)*) as element(item)* { if ($items[1]/@x = $items[2]/@x and $items[2]/@x = $items[3]/@x) then if (sum($items/@y) < 0.5) then $items else () else () };
for $seq in doc("data.xml")//seq return <seq>{ for $i in 1 to (count($seq/item) - 2) let $window := subsequence($seq/item, $i, 3) return <match>{local:check-sequence($window)}</match> }</seq>
Here's the puzzling thing. When I run this on an xml file that has one <seq> with 10000 <item>s, i.e. <seqs><seq> <item>[*10000] </seq></seqs>, I get the following reasonable timings:
- Parsing: 0.63 ms - Compiling: 0.95 ms - Optimizing: 30.59 ms - Evaluating: 11.48 ms - Printing: 0.61 ms - Total Time: 44.27 ms
However, once I run the same query on an xml file with two (!) <seq> of 10000 items each, the runtime does not double, but is now more than 50 times of that of the first query:
- Parsing: 0.6 ms - Compiling: 1.3 ms - Optimizing: 48.13 ms - Evaluating: 2843.67 ms - Printing: 1.12 ms - Total Time: 2894.82 ms
I don't see how I induce any exponential runtimes, and esp. check-sequence should be constant. Am I missing something fundamental here?
The result size of the second version is about twice the size as the result size of the first version.
Thanks in advance for any pointers.
Bernhard Liebl
On 15/07/2024 08:24, Bernhard Liebl wrote:
Hello,
as a rather unexperienced xquery and basex user, I'm puzzled by the timings of running the same xquery on two very similar xml files.
For an academic paper, I'm trying to benchmark finding certain successive (windowed) <item> tags inside a <seq> tag. This is the full query:
declare function local:check-sequence($items as element(item)*) as element(item)* { if ($items[1]/@x = $items[2]/@x and $items[2]/@x = $items[3]/@x) then if (sum($items/@y) < 0.5) then $items else () else () };
for $seq in doc("data.xml")//seq return <seq>{ for $i in 1 to (count($seq/item) - 2) let $window := subsequence($seq/item, $i, 3) return <match>{local:check-sequence($window)}</match> }</seq>
Here's the puzzling thing. When I run this on an xml file that has one <seq> with 10000 <item>s, i.e. <seqs><seq> <item>[*10000] </seq></seqs>, I get the following reasonable timings:
- Parsing: 0.63 ms
- Compiling: 0.95 ms
- Optimizing: 30.59 ms
- Evaluating: 11.48 ms
- Printing: 0.61 ms
- Total Time: 44.27 ms
However, once I run the same query on an xml file with two (!) <seq> of 10000 items each, the runtime does not double, but is now more than 50 times of that of the first query:
- Parsing: 0.6 ms
- Compiling: 1.3 ms
- Optimizing: 48.13 ms
- Evaluating: 2843.67 ms
- Printing: 1.12 ms
- Total Time: 2894.82 ms
I don't see how I induce any exponential runtimes, and esp. check-sequence should be constant. Am I missing something fundamental here?
The result size of the second version is about twice the size as the result size of the first version.
Thanks in advance for any pointers.
you would make it a lot easier to reproduce the results if you provided the sample data, perhaps in a repository.
The other issue that the description about "window" raises is as to whether you know about the XQuery 3.1 window clause https://www.w3.org/TR/xquery-31/#id-windows, perhaps that is a more suitable way to approach that task. Whether it solves your performance problems remains to be seen with sample data.
On Mon, 2024-07-15 at 08:24 +0200, Bernhard Liebl wrote:
for $seq in doc("data.xml")//seq
You could try changing this to doc("data.xml")/seqs/seq so that it doesn't have to look at every item element to see if it's a seq element. It'll also likely run faster if you load the data into the database instead of using doc(), as then there will presumably be an element index.
I also notice the time to compile and optimize the query was longer for the second run. Maybe the computer was low on memory?
liam
Hi Bernhard,
The short answer is that BaseX exploits the fact that your contains a single seq element, and evaluates it faster than how it would be evaluated trivially. With multiple sequences, the optimization does not come into play. I’ll check if we can improve that; thank you [1].
To reduce the runtime complexity by yourself, you can bind the items to an extra variable…
let $items := $seq/item for $i in 1 to count($items) - 2 let $window := subsequence($items, $i, 3)
…or use the window clause [2], as suggested by Martin:
for sliding window $w in 1 to 5 start at $s end at $e when $e - $s + 1 = 3 return array { $w }
Best, Christian
[1] https://github.com/BaseXdb/basex/issues/2316 [2] https://docs.basex.org/main/XQuery_4.0#window_clause
On Mon, Jul 15, 2024 at 8:24 AM Bernhard Liebl < liebl@informatik.uni-leipzig.de> wrote:
Hello,
as a rather unexperienced xquery and basex user, I'm puzzled by the timings of running the same xquery on two very similar xml files.
For an academic paper, I'm trying to benchmark finding certain successive (windowed) <item> tags inside a <seq> tag. This is the full query:
declare function local:check-sequence($items as element(item)*) as element(item)* { if ($items[1]/@x = $items[2]/@x and $items[2]/@x = $items[3]/@x) then if (sum($items/@y) < 0.5) then $items else () else () };
for $seq in doc("data.xml")//seq return <seq>{ for $i in 1 to (count($seq/item) - 2) let $window := subsequence($seq/item, $i, 3) return <match>{local:check-sequence($window)}</match> }</seq>
Here's the puzzling thing. When I run this on an xml file that has one <seq> with 10000 <item>s, i.e. <seqs><seq> <item>[*10000] </seq></seqs>, I get the following reasonable timings:
- Parsing: 0.63 ms
- Compiling: 0.95 ms
- Optimizing: 30.59 ms
- Evaluating: 11.48 ms
- Printing: 0.61 ms
- Total Time: 44.27 ms
However, once I run the same query on an xml file with two (!) <seq> of 10000 items each, the runtime does not double, but is now more than 50 times of that of the first query:
- Parsing: 0.6 ms
- Compiling: 1.3 ms
- Optimizing: 48.13 ms
- Evaluating: 2843.67 ms
- Printing: 1.12 ms
- Total Time: 2894.82 ms
I don't see how I induce any exponential runtimes, and esp. check-sequence should be constant. Am I missing something fundamental here?
The result size of the second version is about twice the size as the result size of the first version.
Thanks in advance for any pointers.
Bernhard Liebl
An new snapshot version of BaseX is available [1]:
• In a trivial implementation, the repeated evaluation of subsequence($seq/item, $i, 3) results in a repeated evaluation of $seq/item, which is expensive. • In the current BaseX release, the result of $seq/item is already cached when it’s evaluated multiple times, but caching is disabled if the context changes. This explains why multiple input sequences slow down everything. • With the latest snapshot, the cache is updated with each context switch. • If the context changes too often, caching is disabled (it’s always a tradeoff to cache data or process it iteratively).
Hope this helps, Christian
[1] https://files.basex.org/releases/latest/
On Mon, Jul 15, 2024 at 11:52 AM Christian Grün christian.gruen@gmail.com wrote:
Hi Bernhard,
The short answer is that BaseX exploits the fact that your contains a single seq element, and evaluates it faster than how it would be evaluated trivially. With multiple sequences, the optimization does not come into play. I’ll check if we can improve that; thank you [1].
To reduce the runtime complexity by yourself, you can bind the items to an extra variable…
let $items := $seq/item for $i in 1 to count($items) - 2 let $window := subsequence($items, $i, 3)
…or use the window clause [2], as suggested by Martin:
for sliding window $w in 1 to 5 start at $s end at $e when $e - $s + 1 = 3 return array { $w }
Best, Christian
[1] https://github.com/BaseXdb/basex/issues/2316 [2] https://docs.basex.org/main/XQuery_4.0#window_clause
On Mon, Jul 15, 2024 at 8:24 AM Bernhard Liebl < liebl@informatik.uni-leipzig.de> wrote:
Hello,
as a rather unexperienced xquery and basex user, I'm puzzled by the timings of running the same xquery on two very similar xml files.
For an academic paper, I'm trying to benchmark finding certain successive (windowed) <item> tags inside a <seq> tag. This is the full query:
declare function local:check-sequence($items as element(item)*) as element(item)* { if ($items[1]/@x = $items[2]/@x and $items[2]/@x = $items[3]/@x) then if (sum($items/@y) < 0.5) then $items else () else () };
for $seq in doc("data.xml")//seq return <seq>{ for $i in 1 to (count($seq/item) - 2) let $window := subsequence($seq/item, $i, 3) return <match>{local:check-sequence($window)}</match> }</seq>
Here's the puzzling thing. When I run this on an xml file that has one <seq> with 10000 <item>s, i.e. <seqs><seq> <item>[*10000] </seq></seqs>, I get the following reasonable timings:
- Parsing: 0.63 ms
- Compiling: 0.95 ms
- Optimizing: 30.59 ms
- Evaluating: 11.48 ms
- Printing: 0.61 ms
- Total Time: 44.27 ms
However, once I run the same query on an xml file with two (!) <seq> of 10000 items each, the runtime does not double, but is now more than 50 times of that of the first query:
- Parsing: 0.6 ms
- Compiling: 1.3 ms
- Optimizing: 48.13 ms
- Evaluating: 2843.67 ms
- Printing: 1.12 ms
- Total Time: 2894.82 ms
I don't see how I induce any exponential runtimes, and esp. check-sequence should be constant. Am I missing something fundamental here?
The result size of the second version is about twice the size as the result size of the first version.
Thanks in advance for any pointers.
Bernhard Liebl
Just to report: with the latest BaseX snapshot, my timing issues have gone away, and querying multiple sequences behaves as expected.
Thank you!
Bernhard Liebl Am 15. Juli 2024, 18:39 +0200 schrieb Christian Grün christian.gruen@gmail.com:
An new snapshot version of BaseX is available [1]:
• In a trivial implementation, the repeated evaluation of subsequence($seq/item, $i, 3) results in a repeated evaluation of $seq/item, which is expensive. • In the current BaseX release, the result of $seq/item is already cached when it’s evaluated multiple times, but caching is disabled if the context changes. This explains why multiple input sequences slow down everything. • With the latest snapshot, the cache is updated with each context switch. • If the context changes too often, caching is disabled (it’s always a tradeoff to cache data or process it iteratively).
Hope this helps, Christian
[1] https://files.basex.org/releases/latest/
On Mon, Jul 15, 2024 at 11:52 AM Christian Grün christian.gruen@gmail.com wrote:
Hi Bernhard,
The short answer is that BaseX exploits the fact that your contains a single seq element, and evaluates it faster than how it would be evaluated trivially. With multiple sequences, the optimization does not come into play. I’ll check if we can improve that; thank you [1].
To reduce the runtime complexity by yourself, you can bind the items to an extra variable…
let $items := $seq/item for $i in 1 to count($items) - 2 let $window := subsequence($items, $i, 3)
…or use the window clause [2], as suggested by Martin:
for sliding window $w in 1 to 5 start at $s end at $e when $e - $s + 1 = 3 return array { $w }
Best, Christian
[1] https://github.com/BaseXdb/basex/issues/2316 [2] https://docs.basex.org/main/XQuery_4.0#window_clause
On Mon, Jul 15, 2024 at 8:24 AM Bernhard Liebl liebl@informatik.uni-leipzig.de wrote:
Hello,
as a rather unexperienced xquery and basex user, I'm puzzled by the timings of running the same xquery on two very similar xml files.
For an academic paper, I'm trying to benchmark finding certain successive (windowed) <item> tags inside a <seq> tag. This is the full query:
declare function local:check-sequence($items as element(item)*) as element(item)* { if ($items[1]/@x = $items[2]/@x and $items[2]/@x = $items[3]/@x) then if (sum($items/@y) < 0.5) then $items else () else () };
for $seq in doc("data.xml")//seq return <seq>{ for $i in 1 to (count($seq/item) - 2) let $window := subsequence($seq/item, $i, 3) return <match>{local:check-sequence($window)}</match> }</seq>
Here's the puzzling thing. When I run this on an xml file that has one <seq> with 10000 <item>s, i.e. <seqs><seq> <item>[*10000] </seq></seqs>, I get the following reasonable timings:
- Parsing: 0.63 ms
- Compiling: 0.95 ms
- Optimizing: 30.59 ms
- Evaluating: 11.48 ms
- Printing: 0.61 ms
- Total Time: 44.27 ms
However, once I run the same query on an xml file with two (!) <seq> of 10000 items each, the runtime does not double, but is now more than 50 times of that of the first query:
- Parsing: 0.6 ms
- Compiling: 1.3 ms
- Optimizing: 48.13 ms
- Evaluating: 2843.67 ms
- Printing: 1.12 ms
- Total Time: 2894.82 ms
I don't see how I induce any exponential runtimes, and esp. check-sequence should be constant. Am I missing something fundamental here?
The result size of the second version is about twice the size as the result size of the first version.
Thanks in advance for any pointers.
Bernhard Liebl
basex-talk@mailman.uni-konstanz.de