Christian,

Thanks for all your responses. It truly helps a lot.

re: Importing data into databases: I realized, for the extent of this POC, I will just count no of docs in each database (currently programmed to be 50) and keep creating new databases. Structure of data is same, but its nested in nature. Like a folder can have folder, which can have file etc. Usually, it won't be more than 4 levels deep. Thats a good tip, to guess no of nodes based on byte size. I guess, for time being I will move on, with just storing 50 docs per DB.

re: terabytes of data. Well, I am planning on using ~6 months worth of data for any analysis and discarding data prior to that (leaving it around in backups). Obviously, would be going some cloud route for such resources, will see how much budget I can manage to get :) Am very positive about this. So, no its not only a theoretical assumption as far as I can see.

re: Currently, I am looking into querying these databases. I am exploring REST for it. From documentation, it seems our only option is supporting these queries (on server side) using XQUERY or RestXQ, no Java/Python ? I am well versed with XPATH and XSLT, gearing up towards XQUERY now. But, would be a little easier (just my personal preference :)) to manipulate data in Java/Python before serving it back to client. Is there any such facility ? Something like:

"http://localhost:8984/rest?run=getData.java"

similarly for python ?

- Mansi

Some preliminary statistics: Imported 2050 XML documents in 22 min (including indexing on attributes).

On Sun, Oct 19, 2014 at 6:14 PM, Christian Grün <christian.gruen@gmail.com> wrote:

Hi Mansi,

> Is there some book/resource you can point me to, which helps better visualize NXD ?

sorry for letting you wait. If you want to know more about native XML
databases, I recommend you to have a closer look at various articles
in our Wiki (e. g. [1,2]). It will also be helpful if you get into the
basics of XQuery [3].

Have you tried to realize some of the hints I gave in my previous mails?

> I am trying to distribute data across multiple databases. I can't distribute
> based on day, as there could very well be situation, where single day's data
> could more than capacity of BaseX DB.

If 2 billion XML nodes per day are not enough, you will probably need
to create more than one database per day. Via the "info db" command,
you see how many nodes are currently stored in a database, but there
is no cheap solution to find out the number of nodes of an incoming
document, because XML documents can be very heterogeneous. Some
questions back:

* Do you have some more information on the data you want to store?
* Are all documents similar or do they vary greatly? If the documents
are somewhat similar, you can usually estimate the number of nodes by
looking at the byte size.
* Do you know that you will really need to store lots of terabytes of
XML data, or it is more like a theoretical assumption?

Christian

[1] http://docs.basex.org/wiki/Database
[2] http://docs.basex.org/wiki/Table_of_Contents
[3] http://docs.basex.org/wiki/Xquery

--
- Mansi