Just finished processing 310GB of data, with result set worth 11 million records within 44 minutes. I am currently psyched with the potential of even BaseX supporting this kind of data. But I am no expert here.
What are your views on this performance statistics ?
My assumption is that it basically boils down to a sequential scan of most of the elements in the database (so buying faster SSDs will probably be the safest choice to speed up your queries..). 310 GB is a lot, so 44 minutes is probably not that bad. Speaking for myself, though, I was sometimes surprised that other NoSQL systems I tried were not really faster than BaseX, if you have hierarchical data structures, and if you need to post-process large amounts of data.
However, as your queries look pretty simple, you could also have a look at e.g. MongoDB or RethinkDB (provided that the data can be converted to JSON). Those systems give you convenient Big Data features like distribution/sharding or replication.
But I'm also interested what others say about this. Christian
- Mansi
On Sun, Jan 18, 2015 at 10:49 AM, Christian Grün christian.gruen@gmail.com wrote:
Hi Mansi,
http://localhost:8984/rest?run=get_query.xq&n=/Archives/*/descendant::c/...)
My guess is that most time is spent to parse all the nodes in the database. If you know more about the database structure, you could replace some of the descendant with explicit child steps. Apart from that, I guess I'm repeating myself, but have you tried to remove duplicates in XQuery, or do grouping and sorting in the language? Usually, it's recommendable to do as much as possible in XQuery itself (although it might not be obvious how to do this at first glance).
Christian
--
- Mansi