Question was about set operations in BaseX. I had to calculate set difference on a sequence of values. XQuery has operator except, but it works only on nodes. I guess normal code for this is using distinct-values:
declare function local:difference-pred($a, $b) { distinct-values($a[not(.=$b)]) };
I used it on big (thousands of values) sets, and it was slow. Then I tried on maps using BaseX map-module :
declare function local:difference-map($a, $b) { let $m1 := map:new(for $i in $a return map:entry($i, true())) let $m2 := map:new(for $i in $b return map:entry($i, false())) let $m3 := map:new(($m1, $m2)) return for $i in map:keys($m3) return if ($m3($i)) then $i else () };
Then we found another solution, with only one map:
declare function local:difference-map-2($a, $b) { let $m2 := map:new(for $i in $b return map:entry($i, true())) return for $i in $a return if($m2($i)) then () else $i };
When trying them at same sequences:
let $a :=for $i in (1 to 100000) return if (random:double() < 0.01) then () else string($i) let $b := for $i in $a return if (random:double() < 0.45) then () else $i
return (count($a), count($b), count(prof:time( local:difference-pred($a, $b), true(), 'pred ')), count(prof:time( local:difference-map($a, $b), true(), 'map ')), count(prof:time( local:difference-map($a, $b), true(), 'map2 ')))
This gives times (on BaseX 7.6 running on OpenSUSE on VMWare virtual machine on a PC).
pred: 80468.68ms map: 261.23 ms map2: 253.77 ms
That is: map2 is 317 times faster than distinct-values version. I did not measure memory usage. Also on different size sequences, each one of the functions can be fastest!
-- Arto Viitanen Finland
Hej Arto,
just being interested: How does it compare to a FLOWR expression?
declare function local:difference-flowr($a, $b) { for $x in $a where not($x = $b) group by $x return $x[1] };
cheers Arve
Am 25.09.13 07:22 schrieb Arto Viitanen:
Question was about set operations in BaseX. I had to calculate set difference on a sequence of values. XQuery has operator except, but it works only on nodes. I guess normal code for this is using distinct-values:
declare function local:difference-pred($a, $b) { distinct-values($a[not(.=$b)]) };
I used it on big (thousands of values) sets, and it was slow. Then I tried on maps using BaseX map-module :
declare function local:difference-map($a, $b) { let $m1 := map:new(for $i in $a return map:entry($i, true())) let $m2 := map:new(for $i in $b return map:entry($i, false())) let $m3 := map:new(($m1, $m2)) return for $i in map:keys($m3) return if ($m3($i)) then $i else () };
Then we found another solution, with only one map:
declare function local:difference-map-2($a, $b) { let $m2 := map:new(for $i in $b return map:entry($i, true())) return for $i in $a return if($m2($i)) then () else $i };
When trying them at same sequences:
let $a :=for $i in (1 to 100000) return if (random:double() < 0.01) then () else string($i) let $b := for $i in $a return if (random:double() < 0.45) then () else $i
return (count($a), count($b), count(prof:time( local:difference-pred($a, $b), true(), 'pred ')), count(prof:time( local:difference-map($a, $b), true(), 'map ')), count(prof:time( local:difference-map($a, $b), true(), 'map2 ')))
This gives times (on BaseX 7.6 running on OpenSUSE on VMWare virtual machine on a PC).
pred: 80468.68ms map: 261.23 ms map2: 253.77 ms
That is: map2 is 317 times faster than distinct-values version. I did not measure memory usage. Also on different size sequences, each one of the functions can be fastest!
-- Arto Viitanen Finland _______________________________________________ BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk
-----Original Message----- From: Arve Gengelbach [mailto:ag@basex.org] Sent: 25. syyskuuta 2013 9:08 To: Arto Viitanen Cc: basex-talk@mailman.uni-konstanz.de Subject: Re: [basex-talk] Set Operator Examples
Hej Arto,
just being interested: How does it compare to a FLOWR expression?
declare function local:difference-flowr($a, $b) { for $x in $a where not($x = $b) group by $x return $x[1] };
pred 51787.06 ms map 485.16 ms map2 325.09 ms flowr 45275.97 ms
-- Arto Viitanen Finland
Hej Arto,
One can even improve your map2 by using the [Simple map operator].
declare function local:difference-map-2-excl($a, $b) { let $m2 := map:new($b ! map:entry(., true())) return $a ! (if($m2(.)) then () else .) };
- map 299.47 ms - map2 290.96 ms - map2 + ! 128.25 ms
cheers Arve
[Simple map operator] http://www.w3.org/TR/xquery-30/#id-map-operator
Am 25.09.13 10:41 schrieb Arto Viitanen:
-----Original Message----- From: Arve Gengelbach [mailto:ag@basex.org] Sent: 25. syyskuuta 2013 9:08 To: Arto Viitanen Cc: basex-talk@mailman.uni-konstanz.de Subject: Re: [basex-talk] Set Operator Examples
Hej Arto,
just being interested: How does it compare to a FLOWR expression?
declare function local:difference-flowr($a, $b) { for $x in $a where not($x = $b) group by $x return $x[1] };
pred 51787.06 ms map 485.16 ms map2 325.09 ms flowr 45275.97 ms
-- Arto Viitanen Finland
Hi, Nice work. I wonder if the BaseX query optimizer could not spot the idioms for the set operations and rewrite the query to something similar for the cases:
distinct-values(($arg1, $arg2)) distinct-values($arg1[.=$arg2]) distinct-values($arg1[not(.=$arg2)])
Regards /Andy
On Wed, Sep 25, 2013 at 8:04 PM, Arve Gengelbach ag@basex.org wrote:
Hej Arto,
One can even improve your map2 by using the [Simple map operator].
declare function local:difference-map-2-excl($a, $b) { let $m2 := map:new($b ! map:entry(., true())) return $a ! (if($m2(.)) then () else .) };
- map 299.47 ms
- map2 290.96 ms
- map2 + ! 128.25 ms
cheers Arve
[Simple map operator] http://www.w3.org/TR/xquery-30/#id-map-operator
Am 25.09.13 10:41 schrieb Arto Viitanen:
-----Original Message----- From: Arve Gengelbach [mailto:ag@basex.org] Sent: 25. syyskuuta 2013 9:08 To: Arto Viitanen Cc: basex-talk@mailman.uni-konstanz.de Subject: Re: [basex-talk] Set Operator Examples
Hej Arto,
just being interested: How does it compare to a FLOWR expression?
declare function local:difference-flowr($a, $b) { for $x in $a where not($x = $b) group by $x return $x[1] };
pred 51787.06 ms map 485.16 ms map2 325.09 ms flowr 45275.97 ms
-- Arto Viitanen Finland
BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk
I had a typo in my example. The return did not call the map2 version. With it, the results were (now on 7.7 and Windows 7)
- pred 58278.25 ms - map 272.68 ms - map2 74.27 ms - map2! 92.97 ms - flowr 43018.3 ms
Hej Arto,
One can even improve your map2 by using the [Simple map operator].
declare function local:difference-map-2-excl($a, $b) { let $m2 := map:new($b ! map:entry(., true())) return $a ! (if($m2(.)) then () else .) };
- map 299.47 ms - map2 290.96 ms - map2 + ! 128.25 ms
.. Arto
I had a typo in my example. The return did not call the map2 version. With it, the results were (now on 7.7 and Windows 7)
And I posted the wrong version. Usage of ! within map:new() is not as fast as a flowr expression (in general).
BUT neither of the “fastest two” functions yields valid results (e.g. take 2,2,3 for $a and empty sequence for $b) Arto, I hope replacing one flowr expression with the simple map operator can improve speed in your use-cases. So hopefully this is fastest for you:
declare function local:difference-maps($a, $b) { let $m1 := map:new(for $i in $a return map:entry($i, true())) let $m2 := map:new(for $i in $b return map:entry($i, false())) let $m3 := map:new(($m1, $m2)) return map:keys($m3) ! (if ($m3(.)) then . else ()) };
As the speed gain depends on the spreading of the data, rewriting set operations needs proove of efficency for any given data.
Thanks Arto for the interesting comparisons.
One more rewriting for map2 could be..
declare function local:map2($a, $b) { let $m2 := map:new($b ! map:entry(., true())) return $a[$m2(.)] };
..but my assumption is that all rewritings should yield similar performance, because the flwor expression, predicates and the map operator are internally basically evaluated the same.
@Andy:
distinct-values(($arg1, $arg2)) distinct-values($arg1[.=$arg2]) distinct-values($arg1[not(.=$arg2)])
the first query should be fast enough, but it may be beneficial to optimize the (non-)equality test in the next two queries. I’m not sure, though, if this optimization can be generalized that easily. ___________________________
2013/9/26 Arve Gengelbach ag@basex.org:
I had a typo in my example. The return did not call the map2 version. With it, the results were (now on 7.7 and Windows 7)
And I posted the wrong version. Usage of ! within map:new() is not as fast as a flowr expression (in general).
BUT neither of the “fastest two” functions yields valid results (e.g. take 2,2,3 for $a and empty sequence for $b) Arto, I hope replacing one flowr expression with the simple map operator can improve speed in your use-cases. So hopefully this is fastest for you:
declare function local:difference-maps($a, $b) { let $m1 := map:new(for $i in $a return map:entry($i, true())) let $m2 := map:new(for $i in $b return map:entry($i, false())) let $m3 := map:new(($m1, $m2)) return map:keys($m3) ! (if ($m3(.)) then . else ()) };
As the speed gain depends on the spreading of the data, rewriting set operations needs proove of efficency for any given data. _______________________________________________ BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk
basex-talk@mailman.uni-konstanz.de