BaseX Capacity

List overview All Threads
Download

newer

older

Improvements. On-fly in-memory...

Enhanced Locking in BaseX:...

Rajabrata Chaudhuri

26 Mar 2013 26 Mar '13

4:22 p.m.

Hello,

First I'd like to thank you guys for all your great work on BaseX. I am fairly familiar with XML DBs and have done a significant amount of development on top of Mark Logic. I would like to ask some questions about capacity and scalability. I have reviewed the documentation and see that the biggest store is for SDMX @ approximately 8000 GB. So I am just trying to understand what this means better and would appreciate any of your expert advice for my questions below:

1. Is the expectation that you can query against 8 TB of XML data efficiently? 2. My requirements will be to query across probably 24 TB of XML data. Do you guys feel this is possible? 3. What is the method to scale horizontally and vertically? I.E. Would I be adding more servers, or starting more instances, etc.? 4. How does high availability work? I.E. Can I have multiple active-active nodes, or should it be active-passive, etc.?

Any help anyone can render is greatly appreciated.

Thanks Raj

Attachments:

attachment.html (text/html — 1.2 KB)

Show replies by date

Fabrice Etanchaud

27 Mar 27 Mar

3:56 a.m.

Hi Raj,

...

From what I read at http://docs.basex.org/wiki/Statistics,

SDMX collection size is 8 008 Mo, about 8 Go, not To.

Best, Fabrice Questel-Orbit.

De : basex-talk-bounces@mailman.uni-konstanz.de [mailto:basex-talk-bounces@mailman.uni-konstanz.de] De la part de Rajabrata Chaudhuri Envoyé : mardi 26 mars 2013 21:23 À : basex-talk@mailman.uni-konstanz.de Objet : [basex-talk] BaseX Capacity

Hello,

Any help anyone can render is greatly appreciated.

Thanks Raj

Dirk Kirsten

28 Mar 28 Mar

5:27 a.m.

Hello Raj,

thanks for your interest in BaseX.

You can see the current upper limits of Basex at [1]. As you can see, the current upper file size limit is 512GiB per database. However, you can always distribute your data across several databases as databases in BaseX are a fairly lightweight concept and you can also access multiple databases within one XQuery expression. So, theoretically you can save Terabytes of data.

However, if query execution against such a large database will be efficient is very difficult to tell. It heavily depends on the type of query you want to run, but personally I would not expect a blasting performance. But again, this is very hard to tell.

Scaling out and replication is currently not supported by BaseX. Of course you can always use some kind of distributed file system to physically distribute your data, but BaseX itself is not doing this for you. Of course, you could start several BaseX servers and store certain data at specific servers, but there will be no synchronization of any kind. However, we would love to change this and this is actually my current project.

I gave a short talk about our plans at our user meet-up at XML Prague. You can see the slides at [2] (hopefully the videos will be there as well any time soon). So, we are interested in scaling out and replication. Therefore, I am also very interested in real-world use cases. I would be very interested if you could tell me more about your specific requirements (either by private mail or mailing list), so that we in the end will have a real-world usable solution.

Cheers, Dirk

[1] http://docs.basex.org/wiki/Statistics [2] http://files.basex.org/xmlprague2013/

On Tue, Mar 26, 2013 at 9:22 PM, Rajabrata Chaudhuri rajabrata@yahoo.comwrote:

...

Hello,

First I'd like to thank you guys for all your great work on BaseX. I am fairly familiar with XML DBs and have done a significant amount of development on top of Mark Logic. I would like to ask some questions about capacity and scalability. I have reviewed the documentation and see that the biggest store is for SDMX @ approximately 8000 GB. So I am just trying to understand what this means better and would appreciate any of your expert advice for my questions below:

Is the expectation that you can query against 8 TB of XML data

efficiently? 2. My requirements will be to query across probably 24 TB of XML data. Do you guys feel this is possible? 3. What is the method to scale horizontally and vertically? I.E. Would I be adding more servers, or starting more instances, etc.? 4. How does high availability work? I.E. Can I have multiple active-active nodes, or should it be active-passive, etc.?

Any help anyone can render is greatly appreciated.

Thanks Raj

BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk

-- Dirk Kirsten, BaseX GmbH, http://basex.org |-- Firmensitz: Blarerstrasse 56, 78462 Konstanz |-- Registergericht Freiburg, HRB: 708285, Geschäftsführer: | Dr. Christian Grün, Alexander Holupirek, Michael Seiferle `-- Phone: 0049 7531 28 28 676, Fax: 0049 7531 20 05 22

Rajabrata Chaudhuri

2:19 p.m.

Hi Dirk,

Thanks for responding to challenges. Just to clarify when you say upper file size limit, are you referring to the individual files? I only ask because I saw a DB limit of "Unlimited", so I was uncertain of the distinction, but thought it probably meant there is not a hard limit on the overall DB size. In my case, the individual files themselves are fairly small, but my total DB size will grow up to about 24 TB...do you see any issues with this in terms of capacity and being able to query fairly quickly across the whole subset - assuming of course my Xquery is tuned? If the 512 GB is the DB size limit, I would be curios to learn about what dictates that limit, and how how I could help

In terms of scaling it sounds like you are saying I can just go to a shared file system and have several Base X instances pointing to that file system. Therefore, as requests came in, I would direct them to specific instances. Would this not be a problem for write updates? I.E. Is there a write locking that will prevent two threads trying to update a document with the same GUID (I am assuming there is a universal ID for each document) simultaneously...perhaps that is part of your current project?

Give me a couple of days, I will write you a detailed brief on my real world use case. Thanks for all your advice and help!

Thanks Raj

________________________________ From: Dirk Kirsten dk@basex.org To: Rajabrata Chaudhuri rajabrata@yahoo.com Cc: "basex-talk@mailman.uni-konstanz.de" basex-talk@mailman.uni-konstanz.de Sent: Thursday, March 28, 2013 2:27 AM Subject: Re: [basex-talk] BaseX Capacity

Hello Raj,

thanks for your interest in BaseX.

Cheers, Dirk

[1] http://docs.basex.org/wiki/Statistics [2] http://files.basex.org/xmlprague2013/

On Tue, Mar 26, 2013 at 9:22 PM, Rajabrata Chaudhuri rajabrata@yahoo.com wrote:

Hello,

...

First I'd like to thank you guys for all your great work on BaseX. I am fairly familiar with XML DBs and have done a significant amount of development on top of Mark Logic. I would like to ask some questions about capacity and scalability. I have reviewed the documentation and see that the biggest store is for SDMX @ approximately 8000 GB. So I am just trying to understand what this means better and would appreciate any of your expert advice for my questions below:

1. Is the expectation that you can query against 8 TB of XML data efficiently? 2. My requirements will be to query across probably 24 TB of XML data. Do you guys feel this is possible? 3. What is the method to scale horizontally and vertically? I.E. Would I be adding more servers, or

starting more instances, etc.?

...

4. How does high availability work? I.E. Can I have multiple active-active nodes, or should it be active-passive, etc.?

Any help anyone can render is greatly appreciated.

Thanks Raj

BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk

-- Dirk Kirsten, BaseX GmbH, http://basex.org/%7C-- Firmensitz: Blarerstrasse 56, 78462 Konstanz |-- Registergericht Freiburg, HRB: 708285, Geschäftsführer: | Dr. Christian Grün, Alexander Holupirek, Michael Seiferle `-- Phone: 0049 7531 28 28 676, Fax: 0049 7531 20 05 22

Rajabrata Chaudhuri

6 May 6 May

2:09 a.m.

Hey Dirk,

I thought I would get back to you on my use case today. In itself, the use case is not really different than any other HA requirement. I.E. A solution that supports 100% up time, which to me is only possible by ensuring multiple instances of everything can point to the same data. Therefore, as an instance of anything server, virtual machine, network, etc. goes down, the end user is not affected.

As far as my real world requirements, I am unsure of how much detail you'd like me to go into, but here is a summary of my use case. I would like to utilize BaseX to aggregate different XML documents from different sources and be able to query across them for analytic data. The use case is somewhat MDM in nature. As an example, I would like to put in sales leads documents submitted from a website, analytic usage documents from a different website, and product information from an internal system and XQuery across the various collections to determine if a particular product was more effectively viewed from one website to the other. Does that make sense?

From an HA standpoint, I would like to have multiple instances of BaseX (perhaps up to 5) and have them share data. If one goes down, then the other should not feel any effect. In short, a true cluster where one instance is aware of the other and all sharing the same store. Is the best way to do this by just putting documents of super fast shared storage every instance can access? I wonder if queuing should be a consideration here.

One other quick question, do you think tuned queries will even work across 8 to 10 TB of data? Please tell me if you think this is a viable solution. I need to store 3 years of data. Each year is approximately 8 TB. First of all, do you think I can even store 8 TB? I was thinking I could separate each year into a different store. That way, in the more rarer cases where previous year's information is required, a slower query can take time to run across the multiple databases and instances. What do you think of this is a possibility?

Any ideas you have are greatly appreciated.

Thanks Raj

________________________________ From: Rajabrata Chaudhuri rajabrata@yahoo.com To: Dirk Kirsten dk@basex.org Cc: "basex-talk@mailman.uni-konstanz.de" basex-talk@mailman.uni-konstanz.de Sent: Thursday, March 28, 2013 11:19 AM Subject: Re: [basex-talk] BaseX Capacity

Hi Dirk,

Give me a couple of days, I will write you a detailed brief on my real world use case. Thanks for all your advice and help!

Thanks Raj

Hello Raj,

thanks for your interest in BaseX.

Cheers, Dirk

[1] http://docs.basex.org/wiki/Statistics [2] http://files.basex.org/xmlprague2013/

On Tue, Mar 26, 2013 at 9:22 PM, Rajabrata Chaudhuri rajabrata@yahoo.com wrote:

Hello,

...

First I'd like to thank you guys for all your great work on BaseX. I am fairly familiar with XML DBs and have done a significant amount of development on top of Mark Logic. I would like to ask some questions about capacity and scalability. I have reviewed the documentation and see that the biggest store is for SDMX @ approximately 8000 GB. So I am just trying to understand what this means better and would appreciate any of your expert advice for my questions below:

1. Is the expectation that you can query against 8 TB of XML data efficiently? 2. My requirements will be to query across probably 24 TB of XML data. Do you guys feel this is possible? 3. What is the method to scale horizontally and vertically? I.E. Would I be adding more servers, or

starting more instances, etc.?

...

4. How does high availability work? I.E. Can I have multiple active-active nodes, or should it be active-passive, etc.?

Any help anyone can render is greatly appreciated.

Thanks Raj

BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk

Dirk Kirsten

5:16 a.m.

Hello Raj,

sorry for not responding earlier, I guess your email before simply slipped my mind. I am NOT getting any younger, it seems ;)

The 512GB you mention is the aggregated file size of the original input documents. The storage layout is of course compressed, but you will be somehow limited by the file size. However, this applies per database and you are free to create more databases - Using XQuery you can also query them within one XQuery, so maybe it is easier to think of the term 'database' in BaseX more like 'collection'. It is pretty lightweight. So if you want to store 10TB you should definitely split that up in several databases. Splitting up by year seems very reasonable, but as you mention 3 years and 10TB, it might be better to split it up even more granular, i.e. months. I guess this will also speed up your queries as I guess you don't always use all data in each query, but instead in certain time periods (e.g. how many views from website for a certain product in the last three months - By splitting up in databases per month you would have to access just three databases which far less data in it)

Your use case of using BaseX to aggregate data from different sources seems like a perfect fit. BaseX basically doesn't care where the documents come from, if you put them into the database (using REST, RestXQ or any other API) you can easily query them later.

However, the crucial point here is your need for HA. Using a distributed file system (which handles write locking on the file system level) you can somehow circumvent this, but the performance will certainly not be very good. BaseX currently does not support High Availability or any sort of failover management on its own. Using a NAS you can start up several BaseX instances using the same data, but I would personally avoid writing to different instances, not sure if/how this will work. But in the case of failure (your first server goes down) you can then use another instance. This is not optimal as it of course lacks features of true HA and replication. So my project at the moment is in fact distribution management (i.e. having severals nodes running in a cluster, you send your data to one of them and it will automagically distribute the data to be stored on a certain node, not having you to worry about it) and replication (meaning that there is no single point of failure). But this is still in the making, so please bear with me.

Cheers, Dirk

On Mon, May 6, 2013 at 8:09 AM, Rajabrata Chaudhuri rajabrata@yahoo.comwrote:

...

Hey Dirk,

I thought I would get back to you on my use case today. In itself, the use case is not really different than any other HA requirement. I.E. A solution that supports 100% up time, which to me is only possible by ensuring multiple instances of everything can point to the same data. Therefore, as an instance of anything server, virtual machine, network, etc. goes down, the end user is not affected.

As far as my real world requirements, I am unsure of how much detail you'd like me to go into, but here is a summary of my use case. I would like to utilize BaseX to aggregate different XML documents from different sources and be able to query across them for analytic data. The use case is somewhat MDM in nature. As an example, I would like to put in sales leads documents submitted from a website, analytic usage documents from a different website, and product information from an internal system and XQuery across the various collections to determine if a particular product was more effectively viewed from one website to the other. Does that make sense?

From an HA standpoint, I would like to have multiple instances of BaseX (perhaps up to 5) and have them share data. If one goes down, then the other should not feel any effect. In short, a true cluster where one instance is aware of the other and all sharing the same store. Is the best way to do this by just putting documents of super fast shared storage every instance can access? I wonder if queuing should be a consideration here.

One other quick question, do you think tuned queries will even work across 8 to 10 TB of data? Please tell me if you think this is a viable solution. I need to store 3 years of data. Each year is approximately 8 TB. First of all, do you think I can even store 8 TB? I was thinking I could separate each year into a different store. That way, in the more rarer cases where previous year's information is required, a slower query can take time to run across the multiple databases and instances. What do you think of this is a possibility?

Any ideas you have are greatly appreciated.

Thanks Raj

*From:* Rajabrata Chaudhuri rajabrata@yahoo.com *To:* Dirk Kirsten dk@basex.org *Cc:* "basex-talk@mailman.uni-konstanz.de" < basex-talk@mailman.uni-konstanz.de> *Sent:* Thursday, March 28, 2013 11:19 AM

*Subject:* Re: [basex-talk] BaseX Capacity

Hi Dirk,

Thanks for responding to challenges. Just to clarify when you say upper file size limit, are you referring to the individual files? I only ask because I saw a DB limit of "Unlimited", so I was uncertain of the distinction, but thought it probably meant there is not a hard limit on the overall DB size. In my case, the individual files themselves are fairly small, but my total DB size will grow up to about 24 TB...do you see any issues with this in terms of capacity and being able to query fairly quickly across the whole subset - assuming of course my Xquery is tuned? If the 512 GB is the DB size limit, I would be curios to learn about what dictates that limit, and how how I could help

In terms of scaling it sounds like you are saying I can just go to a shared file system and have several Base X instances pointing to that file system. Therefore, as requests came in, I would direct them to specific instances. Would this not be a problem for write updates? I.E. Is there a write locking that will prevent two threads trying to update a document with the same GUID (I am assuming there is a universal ID for each document) simultaneously...perhaps that is part of your current project?

Give me a couple of days, I will write you a detailed brief on my real world use case. Thanks for all your advice and help!

Thanks Raj

*From:* Dirk Kirsten dk@basex.org *To:* Rajabrata Chaudhuri rajabrata@yahoo.com *Cc:* "basex-talk@mailman.uni-konstanz.de" < basex-talk@mailman.uni-konstanz.de> *Sent:* Thursday, March 28, 2013 2:27 AM *Subject:* Re: [basex-talk] BaseX Capacity

Hello Raj,

thanks for your interest in BaseX.

You can see the current upper limits of Basex at [1]. As you can see, the current upper file size limit is 512GiB per database. However, you can always distribute your data across several databases as databases in BaseX are a fairly lightweight concept and you can also access multiple databases within one XQuery expression. So, theoretically you can save Terabytes of data.

However, if query execution against such a large database will be efficient is very difficult to tell. It heavily depends on the type of query you want to run, but personally I would not expect a blasting performance. But again, this is very hard to tell.

Scaling out and replication is currently not supported by BaseX. Of course you can always use some kind of distributed file system to physically distribute your data, but BaseX itself is not doing this for you. Of course, you could start several BaseX servers and store certain data at specific servers, but there will be no synchronization of any kind. However, we would love to change this and this is actually my current project.

I gave a short talk about our plans at our user meet-up at XML Prague. You can see the slides at [2] (hopefully the videos will be there as well any time soon). So, we are interested in scaling out and replication. Therefore, I am also very interested in real-world use cases. I would be very interested if you could tell me more about your specific requirements (either by private mail or mailing list), so that we in the end will have a real-world usable solution.

Cheers, Dirk

[1] http://docs.basex.org/wiki/Statistics [2] http://files.basex.org/xmlprague2013/

On Tue, Mar 26, 2013 at 9:22 PM, Rajabrata Chaudhuri rajabrata@yahoo.comwrote:

Hello,

First I'd like to thank you guys for all your great work on BaseX. I am fairly familiar with XML DBs and have done a significant amount of development on top of Mark Logic. I would like to ask some questions about capacity and scalability. I have reviewed the documentation and see that the biggest store is for SDMX @ approximately 8000 GB. So I am just trying to understand what this means better and would appreciate any of your expert advice for my questions below:

Is the expectation that you can query against 8 TB of XML data

efficiently? 2. My requirements will be to query across probably 24 TB of XML data. Do you guys feel this is possible? 3. What is the method to scale horizontally and vertically? I.E. Would I be adding more servers, or starting more instances, etc.? 4. How does high availability work? I.E. Can I have multiple active-active nodes, or should it be active-passive, etc.?

Any help anyone can render is greatly appreciated.

Thanks Raj

BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk

-- Dirk Kirsten, BaseX GmbH, http://basex.org/ |-- Firmensitz: Blarerstrasse 56, 78462 Konstanz |-- Registergericht Freiburg, HRB: 708285, Geschäftsführer: | Dr. Christian Grün, Alexander Holupirek, Michael Seiferle `-- Phone: 0049 7531 28 28 676, Fax: 0049 7531 20 05 22

4455

Age (days ago)

4496

Last active (days ago)

basex-talk@mailman.uni-konstanz.de

5 comments

3 participants

tags (0)

participants (3)

Dirk Kirsten
Fabrice Etanchaud
Rajabrata Chaudhuri