Hello,
I am doing some performance analysis on size of XML files in DB, no of records in a result set and how much time it takes to get me results.
Currently, I have 150GB worth of XML documents imported into BaseXDB. It took roughly 21 minutes to return back result set worth 5.3 million records.
Queries are of below form:
http://localhost:8984/rest?run=get_query.xq&n=/Archives/*/descendant::c/..., "xyz")]/@name/data()
XQUERY File: for $db in db:list() (: Assign dynamic variables to generate query, to be used in eval :) let $query := "declare variable $db external; " || "db:open($db)" || $n return xquery:eval($query,map { 'db': $db, 'query': $n })
I have been few questions around this.
1. I have been routinely advices on this email chain, to avoid serialization from XPATH and let xquery handle it. I tried a few things like replacing strings() with data(), adding serialization option on REST call, in XQUERY file etc. But, I don't see any performance gain. Is there something else I can try or something, I am doing wrong ?
2. Does anyone have any resource to compare this performance to other NoSQL databases. I am just very curious, how above performance numbers compares to other DBs ?
- Mansi
Hi Mansi,
http://localhost:8984/rest?run=get_query.xq&n=/Archives/*/descendant::c/...)
My guess is that most time is spent to parse all the nodes in the database. If you know more about the database structure, you could replace some of the descendant with explicit child steps. Apart from that, I guess I'm repeating myself, but have you tried to remove duplicates in XQuery, or do grouping and sorting in the language? Usually, it's recommendable to do as much as possible in XQuery itself (although it might not be obvious how to do this at first glance).
Christian
Structure of data is nested, so I have to write queries this way unfortunately. Also, I am doing performance analysis removing all external parameters like any kind of post-processing, network latency etc. Just isolating if I can do any better. So, guess this is the best I can do... No problem at all.
Just finished processing 310GB of data, with result set worth 11 million records within 44 minutes. I am currently psyched with the potential of even BaseX supporting this kind of data. But I am no expert here.
What are your views on this performance statistics ?
- Mansi
On Sun, Jan 18, 2015 at 10:49 AM, Christian Grün christian.gruen@gmail.com wrote:
Hi Mansi,
http://localhost:8984/rest?run=get_query.xq&n=/Archives/*/descendant::c/..., "xyz")]/@name/data()
My guess is that most time is spent to parse all the nodes in the database. If you know more about the database structure, you could replace some of the descendant with explicit child steps. Apart from that, I guess I'm repeating myself, but have you tried to remove duplicates in XQuery, or do grouping and sorting in the language? Usually, it's recommendable to do as much as possible in XQuery itself (although it might not be obvious how to do this at first glance).
Christian
Just finished processing 310GB of data, with result set worth 11 million records within 44 minutes. I am currently psyched with the potential of even BaseX supporting this kind of data. But I am no expert here.
What are your views on this performance statistics ?
My assumption is that it basically boils down to a sequential scan of most of the elements in the database (so buying faster SSDs will probably be the safest choice to speed up your queries..). 310 GB is a lot, so 44 minutes is probably not that bad. Speaking for myself, though, I was sometimes surprised that other NoSQL systems I tried were not really faster than BaseX, if you have hierarchical data structures, and if you need to post-process large amounts of data.
However, as your queries look pretty simple, you could also have a look at e.g. MongoDB or RethinkDB (provided that the data can be converted to JSON). Those systems give you convenient Big Data features like distribution/sharding or replication.
But I'm also interested what others say about this. Christian
- Mansi
On Sun, Jan 18, 2015 at 10:49 AM, Christian Grün christian.gruen@gmail.com wrote:
Hi Mansi,
http://localhost:8984/rest?run=get_query.xq&n=/Archives/*/descendant::c/...)
My guess is that most time is spent to parse all the nodes in the database. If you know more about the database structure, you could replace some of the descendant with explicit child steps. Apart from that, I guess I'm repeating myself, but have you tried to remove duplicates in XQuery, or do grouping and sorting in the language? Usually, it's recommendable to do as much as possible in XQuery itself (although it might not be obvious how to do this at first glance).
Christian
--
- Mansi
I do send across simplest queries, but they could be deeply nested as well... But yeah not extremely complicated.
Our data sources and tools are written to be configured to return JSON, text and csv format as well, since we didn't know which technology are were going to use for mining... Currently BaseX is working great for me and I am absolutely biased.
However, I avoided other document oriented databases, because I couldn't really find any matured JSON querying libraries/technologies like xpath and xquery... I tried 'jq', was good but for very basic queries, data is relatively flat. Most importantly I felt, with JSON and even jq we needed to know the level of depth of data and it's the same throughout for each document, which is never going to happen for our data.
I have no expertise but a lot of interest in this field, so please correct me if any of my above views are flawed or not well researched.
- Mansi
On Jan 18, 2015, at 11:24 AM, Christian Grün christian.gruen@gmail.com wrote:
Just finished processing 310GB of data, with result set worth 11 million records within 44 minutes. I am currently psyched with the potential of even BaseX supporting this kind of data. But I am no expert here.
What are your views on this performance statistics ?
My assumption is that it basically boils down to a sequential scan of most of the elements in the database (so buying faster SSDs will probably be the safest choice to speed up your queries..). 310 GB is a lot, so 44 minutes is probably not that bad. Speaking for myself, though, I was sometimes surprised that other NoSQL systems I tried were not really faster than BaseX, if you have hierarchical data structures, and if you need to post-process large amounts of data.
However, as your queries look pretty simple, you could also have a look at e.g. MongoDB or RethinkDB (provided that the data can be converted to JSON). Those systems give you convenient Big Data features like distribution/sharding or replication.
But I'm also interested what others say about this. Christian
- Mansi
On Sun, Jan 18, 2015 at 10:49 AM, Christian Grün christian.gruen@gmail.com wrote:
Hi Mansi,
http://localhost:8984/rest?run=get_query.xq&n=/Archives/*/descendant::c/...)
My guess is that most time is spent to parse all the nodes in the database. If you know more about the database structure, you could replace some of the descendant with explicit child steps. Apart from that, I guess I'm repeating myself, but have you tried to remove duplicates in XQuery, or do grouping and sorting in the language? Usually, it's recommendable to do as much as possible in XQuery itself (although it might not be obvious how to do this at first glance).
Christian
--
- Mansi
As part of preparation of presenting at XML Prague, I am working on a slide showing statistics. From below comments, I started thinking, would it be best to show time taken against size of the DB or against no of nodes. What do you all think ? If I am thinking it from no of nodes basis, would it be a little better comparison with other tools ? For e.g.
1 million records in SQL database ~= 1 million nodes in BaseX, thus making closer to apples to apples comparison for time taken.
We are currently, battling with this at work too. There are few different approaches for data mining, for different data sources. I talk in terms of GBs of data in database and SQL fans, talk in terms of millions of records. Its hard to make any progress and push for NXDs.
- Mansi
- Mansi
On Sun, Jan 18, 2015 at 11:24 AM, Christian Grün christian.gruen@gmail.com wrote:
Just finished processing 310GB of data, with result set worth 11 million records within 44 minutes. I am currently psyched with the potential of
even
BaseX supporting this kind of data. But I am no expert here.
What are your views on this performance statistics ?
My assumption is that it basically boils down to a sequential scan of most of the elements in the database (so buying faster SSDs will probably be the safest choice to speed up your queries..). 310 GB is a lot, so 44 minutes is probably not that bad. Speaking for myself, though, I was sometimes surprised that other NoSQL systems I tried were not really faster than BaseX, if you have hierarchical data structures, and if you need to post-process large amounts of data.
However, as your queries look pretty simple, you could also have a look at e.g. MongoDB or RethinkDB (provided that the data can be converted to JSON). Those systems give you convenient Big Data features like distribution/sharding or replication.
But I'm also interested what others say about this. Christian
- Mansi
On Sun, Jan 18, 2015 at 10:49 AM, Christian Grün <
christian.gruen@gmail.com>
wrote:
Hi Mansi,
http://localhost:8984/rest?run=get_query.xq&n=/Archives/*/descendant::c/..., "xyz")]/@name/data()
My guess is that most time is spent to parse all the nodes in the database. If you know more about the database structure, you could replace some of the descendant with explicit child steps. Apart from that, I guess I'm repeating myself, but have you tried to remove duplicates in XQuery, or do grouping and sorting in the language? Usually, it's recommendable to do as much as possible in XQuery itself (although it might not be obvious how to do this at first glance).
Christian
--
- Mansi
Hi Mansi,
I agree it is tempting to compare SQL and XML databases, but I would recommend you to refrain from doing so. My impression is that the systems are simply too different, so it will always be a comparison of apples and pears, oranges or even toothsticks.
If your data is regular enough to be stored in a simple table, if the data does not have any hierarchies, all values have fixed types, etc., you will never come close to a relational database, no matter which system you try. If you have hierarchies, things look different, but if you store such data relationally, it mainly gets interesting which data model you go for. The pure number of datasets or even nodes does not really matter that much, or it only begins to matter if all other factors have been clarified such that the results can be correctly interpreted.
Well, benchmarking is a huge field in itself. I clearly see that there is a requirement to compare both. I believe the practical approach is often more useful than a generic one:
* If you have real-life data, store this data in the systems of your choice * Formulate real-life queries that give you similar results * Compare these measurements to get an impression on the overall performance
Next, try to find bottlenecks. Optimize your data and queries for the specific systems. If you summarize the results, be careful about the presentation: "X is better than Y" sounds suspicious. Better say "given these datasets, these queries, these xyz, X seems to be...". I have once read an enthusiastic report about the performance of BaseX, claiming that it is much faster than X and Y, and obviously I first was pleased to read that, but as the objective of that report was to find one system that's better than all others, it was not helpful at all. But it is perfectly fine to say that system X proved, or feels, to be the better choice for a given scenario.
I hope my comments were not too lecturing.. Christian
basex-talk@mailman.uni-konstanz.de