Hi,
I found that with the following code the trace shows that, for the two functions (FN) created, both have gotten "ul" inlined. I expected the first to get "ul" and the second "li" and that is also what xpath-matches() receives (X).
declare function local:select($selectors as item()*) as function(node()*) as node()* { let $fns := for $selector in $selectors return if ($selector instance of xs:string) then trace(local:xpath-matches(trace($selector,'X: ')),'FN: ') else $selector return function($nodes) { fold-left($fns, $nodes, function($nodes, $fn) { $fn($nodes) } ) } };
declare function local:xpath-matches($selector as xs:string) { function($node as node()*) as node()* { xquery:eval($selector, map { '': $node }) } };
local:select(('ul','li'))(<ul><li>item</li></ul>)
--Marc
... here are the trace messages
- X: ul - FN: function($node_17 as node()*) as node()* { ((: node()*, true :) xquery:eval("ul", { "":$node_17 })) } - X: li - FN: function($node_17 as node()*) as node()* { ((: node()*, true :) xquery:eval("ul", { "":$node_17 })) }
--Marc
Hi Marc,
This is what I get with the current snapshot:
- X: ul - FN: function($node_17 as node()*) as node()* { ((: node()*, true :) let $selector_18 := "ul" return xquery:eval($selector_18, { "":$node_17 })) } - X: li - FN: function($node_17 as node()*) as node()* { ((: node()*, true :) let $selector_18 := "li" return xquery:eval($selector_18, { "":$node_17 })) }
Did you use one of the more recent snapshots? Christian
On Sun, Nov 16, 2014 at 3:43 PM, Marc van Grootel marc.van.grootel@gmail.com wrote:
... here are the trace messages
- X: ul
- FN: function($node_17 as node()*) as node()* { ((: node()*, true :)
xquery:eval("ul", { "":$node_17 })) }
- X: li
- FN: function($node_17 as node()*) as node()* { ((: node()*, true :)
xquery:eval("ul", { "":$node_17 })) }
--Marc
Hi Christian,
Duh, you're right. I didn't start the new basexgui after I replaced with your latest snapshot. command-line used the newer snapshot, basexgui still the old one.
--Marc
On Sun, Nov 16, 2014 at 5:05 PM, Christian Grün christian.gruen@gmail.com wrote:
Hi Marc,
This is what I get with the current snapshot:
- X: ul
- FN: function($node_17 as node()*) as node()* { ((: node()*, true :)
let $selector_18 := "ul" return xquery:eval($selector_18, { "":$node_17 })) }
- X: li
- FN: function($node_17 as node()*) as node()* { ((: node()*, true :)
let $selector_18 := "li" return xquery:eval($selector_18, { "":$node_17 })) }
Did you use one of the more recent snapshots? Christian
On Sun, Nov 16, 2014 at 3:43 PM, Marc van Grootel marc.van.grootel@gmail.com wrote:
... here are the trace messages
- X: ul
- FN: function($node_17 as node()*) as node()* { ((: node()*, true :)
xquery:eval("ul", { "":$node_17 })) }
- X: li
- FN: function($node_17 as node()*) as node()* { ((: node()*, true :)
xquery:eval("ul", { "":$node_17 })) }
--Marc
Hello,
I love using BaseX and the powers of BaseX. Currently I am able to query ~60GB of XML files under 2.5 mins. I still have a few more optimization a to try. I also do see this data increasing to a couple of TB shortly.
I would love to see if this kind of processing is almost real time (within a min). So my question is there any discussions around supporting distributed processing or clusters of nodes etc ?
- Mansi
Hi Mansi,
it's nice to hear that you have been successfully scaling your database instances so far.
I love using BaseX and the powers of BaseX. Currently I am able to query ~60GB of XML files under 2.5 mins. I still have a few more optimization a to try. I also do see this data increasing to a couple of TB shortly.
I would love to see if this kind of processing is almost real time (within a min). So my question is there any discussions around supporting distributed processing or clusters of nodes etc ?
Yes, distributed processing is a frequently discussed topic. One of our major questions is what challenge to solve first. As you surely know, there are so many different NoSQL stores out there, and all of them tackle different problems. Up to now, we spent most time on replication, but this would not give you better performance.
So I would be interested to hear what kind of distribution techniques you believe would give you better performance. Do you think that a map/reduce approach would be helpful, or do you simply have lots of data that somehow needs to be sent to a client as quickly as possible? In other words, how large are your results sets? Do you really need the complete results, or would you rather like to draw some conclusions from the scanned data?
Back to the current technology… Maybe you could do some Java profiling (using e.g. -Xrunhprof:cpu=samples) in order to find out what's the current bottleneck.
Best, Christian
Sorry about the delay. I was busy preparing a presentation for my company as baseX being a our analytics solution. It was very well received. All thanks to you and everyone on this user list :)
Based on my use cases, I believe (again I am no expert in this domain), map/reduce approach would work better. The result set being returned would contain maximum couple of thousand records with some post-processing on it, as compared to TBs of data being queried. If the querying and processing step could use processing power from clusters of nodes, may be we might get significant performance gain ? What are your thoughts ? What are other use cases, you come across ?
- Mansi
On Mon, Nov 17, 2014 at 10:50 AM, Christian Grün christian.gruen@gmail.com wrote:
Hi Mansi,
it's nice to hear that you have been successfully scaling your database instances so far.
I love using BaseX and the powers of BaseX. Currently I am able to query
~60GB of XML files under 2.5 mins. I still have a few more optimization a to try. I also do see this data increasing to a couple of TB shortly.
I would love to see if this kind of processing is almost real time
(within a min). So my question is there any discussions around supporting distributed processing or clusters of nodes etc ?
Yes, distributed processing is a frequently discussed topic. One of our major questions is what challenge to solve first. As you surely know, there are so many different NoSQL stores out there, and all of them tackle different problems. Up to now, we spent most time on replication, but this would not give you better performance.
So I would be interested to hear what kind of distribution techniques you believe would give you better performance. Do you think that a map/reduce approach would be helpful, or do you simply have lots of data that somehow needs to be sent to a client as quickly as possible? In other words, how large are your results sets? Do you really need the complete results, or would you rather like to draw some conclusions from the scanned data?
Back to the current technology… Maybe you could do some Java profiling (using e.g. -Xrunhprof:cpu=samples) in order to find out what's the current bottleneck.
Best, Christian
Hi Mansi,
The other day, I came across this work [1] [2] by Darin McBeath that may be of interest. It use Apache Spark [3] with Saxon. In principle it looks like one could build something similar using the BaseX jar in place of Saxon.
/Andy
[1] https://github.com/elsevierlabs/spark-xml-utils [2] http://mail-archives.apache.org/mod_mbox/spark-user/201408.mbox/%3C140793661... [3] http://spark.apache.org/
On 20 November 2014 23:03, Mansi Sheth mansi.sheth@gmail.com wrote:
Sorry about the delay. I was busy preparing a presentation for my company as baseX being a our analytics solution. It was very well received. All thanks to you and everyone on this user list :)
Based on my use cases, I believe (again I am no expert in this domain), map/reduce approach would work better. The result set being returned would contain maximum couple of thousand records with some post-processing on it, as compared to TBs of data being queried. If the querying and processing step could use processing power from clusters of nodes, may be we might get significant performance gain ? What are your thoughts ? What are other use cases, you come across ?
- Mansi
On Mon, Nov 17, 2014 at 10:50 AM, Christian Grün < christian.gruen@gmail.com> wrote:
Hi Mansi,
it's nice to hear that you have been successfully scaling your database instances so far.
I love using BaseX and the powers of BaseX. Currently I am able to
query ~60GB of XML files under 2.5 mins. I still have a few more optimization a to try. I also do see this data increasing to a couple of TB shortly.
I would love to see if this kind of processing is almost real time
(within a min). So my question is there any discussions around supporting distributed processing or clusters of nodes etc ?
Yes, distributed processing is a frequently discussed topic. One of our major questions is what challenge to solve first. As you surely know, there are so many different NoSQL stores out there, and all of them tackle different problems. Up to now, we spent most time on replication, but this would not give you better performance.
So I would be interested to hear what kind of distribution techniques you believe would give you better performance. Do you think that a map/reduce approach would be helpful, or do you simply have lots of data that somehow needs to be sent to a client as quickly as possible? In other words, how large are your results sets? Do you really need the complete results, or would you rather like to draw some conclusions from the scanned data?
Back to the current technology… Maybe you could do some Java profiling (using e.g. -Xrunhprof:cpu=samples) in order to find out what's the current bottleneck.
Best, Christian
--
- Mansi
Hi Mansi,
I was busy preparing a presentation for my company as baseX being a our analytics solution. It was very well received.
Nice to hear!
[…] map/reduce […] If the querying and processing step could use processing power from clusters of nodes, may be we might get significant performance gain ? What are your thoughts ? What are other use cases, you come across ?
To answer that question, I invite you to have a look into the excellent master thesis from Lukas Lewandowski [1].
Christian
[1] http://www.inf.uni-konstanz.de/gk/pubsys/publishedFiles/Lewandowski12.pdf
basex-talk@mailman.uni-konstanz.de