Thanks Christian.
re: size of data, I am hoping some days would be quieter than discussed below. But, yes its going to be a lot of data.
I just created a single Database with ~190 XML files of size 8.5 GB total. Activated indexes as well. Creating database using basexgui took close to an hour. Running a simple XQUERY took ~3 min. Database was created on an external USB 3.0 HDD. I will obviously be creating new databases across drives (if this POC is successful, will surely go for cloud) to scale it.
For time being, any and all tips are welcomes to optimize performance.
May be I will soon contribute to the statistics pages :)
- Mansi
On Tue, Oct 7, 2014 at 5:35 AM, Christian Grün christian.gruen@gmail.com wrote:
Dear Mansi,
- I have 1000s of XML files (each between 50MB-400MB) and this is going
to
grow exponentially (~200 / per day). So, my question is how scalable is BaseX ? Can I configure it to use data from my external HDD, in my
initial
prototype ?
So this means you want to add appr. 40 gb of XML files per day, right, amounting to 14 tb/year? This sounds quite a lot indeed. You can have a look at our statistics page [1]; it gives you some insight into the current limits of BaseX.
However, all limits are per single database. You can distribute your data in multiple databases and address multiple databases with a single XPath/XQuery request. For example, you could create a new database every day and run a query over all these databases:
for $db in db:list() return db:open($db)/path/to/your/data
- I plan to heavily use XPATH, for data retrieval. Does BaseX, use any
multi-processing, multi-threading to speed up search ? Any concurrent processing ?
Read-only requests will automatically be multithreaded. If a single query leads to heavy I/O requests, it may be that single threaded processing wlil give you better results (because hard drives are often not very good in reading data in parallel).
- Can I do some post-processing on searched and retrieved data ? Like
sorting, unique elements etc ?
With XQuery (3.0), you can do virtually anything with your data. In most of our data-driven scenarios, all data processing is completely done in BaseX. Some plain examples can be found in our Wiki [2].
Hope this helps, Christian
[1] http://docs.basex.org/wiki/Statistics [2] http://docs.basex.org/wiki/XQuery_3.0