Christian,
Thanks for all your responses. It truly helps a lot.
re: Importing data into databases: I realized, for the extent of this POC, I will just count no of docs in each database (currently programmed to be 50) and keep creating new databases. Structure of data is same, but its nested in nature. Like a folder can have folder, which can have file etc. Usually, it won't be more than 4 levels deep. Thats a good tip, to guess no of nodes based on byte size. I guess, for time being I will move on, with just storing 50 docs per DB.
re: terabytes of data. Well, I am planning on using ~6 months worth of data for any analysis and discarding data prior to that (leaving it around in backups). Obviously, would be going some cloud route for such resources, will see how much budget I can manage to get :) Am very positive about this. So, no its not only a theoretical assumption as far as I can see.
re: Currently, I am looking into querying these databases. I am exploring REST for it. From documentation, it seems our only option is supporting these queries (on server side) using XQUERY or RestXQ, no Java/Python ? I am well versed with XPATH and XSLT, gearing up towards XQUERY now. But, would be a little easier (just my personal preference :)) to manipulate data in Java/Python before serving it back to client. Is there any such facility ? Something like:
similarly for python ?
- Mansi
Some preliminary statistics: Imported 2050 XML documents in 22 min (including indexing on attributes).