Hi Mansi,
it's nice to hear that you have been successfully scaling your database instances so far.
I love using BaseX and the powers of BaseX. Currently I am able to query ~60GB of XML files under 2.5 mins. I still have a few more optimization a to try. I also do see this data increasing to a couple of TB shortly.
I would love to see if this kind of processing is almost real time (within a min). So my question is there any discussions around supporting distributed processing or clusters of nodes etc ?
Yes, distributed processing is a frequently discussed topic. One of our major questions is what challenge to solve first. As you surely know, there are so many different NoSQL stores out there, and all of them tackle different problems. Up to now, we spent most time on replication, but this would not give you better performance.
So I would be interested to hear what kind of distribution techniques you believe would give you better performance. Do you think that a map/reduce approach would be helpful, or do you simply have lots of data that somehow needs to be sent to a client as quickly as possible? In other words, how large are your results sets? Do you really need the complete results, or would you rather like to draw some conclusions from the scanned data?
Back to the current technology… Maybe you could do some Java profiling (using e.g. -Xrunhprof:cpu=samples) in order to find out what's the current bottleneck.
Best, Christian