Hello,
I have evaluated XML databases for an important project, and I'm very happy with BaseX which has great fonctionnalities and is very easy to use.
I'm now about deploying my project in production, so I have new questions. My database has 400 000 XML resource files for a total 2.4 Go size, so it should work fine (http://docs.basex.org/wiki/Statistics).
1) What are the CPU/RAM quantityies recommanded to use BaseX in production ?
2) Is there a benchmark to run to have an evaluation of the CPU/RAM quantities from a BaseX point of view ?
3) What about the JVM options ? In particular should I change the "-Xmx512m" default value ? What will be the best value ?
4) With my database, on a t2.micro EC2 instance (1 VCPU + 1 Go RAM), BaseX is unusable. On a powerful shared server (16 VCPU + 16 Go RAM) it works fine (until I exceed my 512Mb personal hosting limitation). On my shared server, a request like "curl 'http://mydomain/rest/database/file.xml?query=/a/direct/path' takes 1.5 second. It seems to me that it's a lot and should be faster as I give the good resource file and the complete path to BaseX, so there is no computation to do. What do you think about it ?
5) How many simultaneous requests can handle BaseXHTTP ?
Best regards
Florent
Salut Florent,
Welcome to our list.
- What are the CPU/RAM quantityies recommanded to use BaseX in production ?
...
I'm giving you a single answer to your first three questions: It's difficult to give you general advice here, as it mostly depends on what you plan to do with the stored documents. If your data is static, you should be fine with a small amount of memory assigned to the JVM (even 64m can be ok), as most of the caching will be done by the OS anyway.
- With my database, on a t2.micro EC2 instance (1 VCPU + 1 Go RAM),
BaseX is unusable.
That's a good hint. What does "unusable" mean? Did you encounter problems to create the database, or are your queries running out of memory?
On my shared server, a request like "curl 'http://mydomain/rest/database/file.xml?query=/a/direct/path' takes 1.5 second.
If you know that there will only be one "direct" or/and "path" step, you could try the following query (or similar ones):
/a/direct[1]/path[1]
Does your measurement include the serialization (the output) of the result? If yes, do you think that the result size could be an issue?
- How many simultaneous requests can handle BaseXHTTP ?
BaseXHTTP applies Jetty as web server. I am actually not sure what limit Jetty imposes, but most probably it exceeds the maximum of BaseX.
By default, the BaseX itself supports 8 simultaneous transactions [1]. The default can be changed, but, in practice, smaller values yield better results because it reduces the amount of random I/O operations on disk. The main reason for providing concurrent requests at all is not performance, but to allow the execution of other operations while slow operations are being answered.
Hope this helps, Christian
On Thu, Dec 18, 2014 at 2:14 AM, Christian Grün christian.gruen@gmail.com wrote:
- With my database, on a t2.micro EC2 instance (1 VCPU + 1 Go RAM),
BaseX is unusable.
That's a good hint. What does "unusable" mean? Did you encounter problems to create the database, or are your queries running out of memory?
"unusable" means "incredibly slow", the same request which takes 1.5 sec on my shared server takes 30 sec on the micro instance.
For more speed, I tried to partition my database in multiple little databases, and my results are surprising for the same request:
database size | shared CentOS | t2.micro EC2 CoreOS 2.4 Go | 1.5 sec 160% CPU | 30 sec 254 Mo | 0.35 sec 35% CPU | 0.21 sec 224 Ko | 0.24 sec 2% CPU | 0.14 sec
It seems that the database size has a huge impact on BaseX performances, is there somewhere a summary about BaseX good practices (database size and number of resource files) ?
Best regards
Florent
database size | shared CentOS | t2.micro EC2 CoreOS 2.4 Go | 1.5 sec 160% CPU | 30 sec 3% CPU ??!! 254 Mo | 0.35 sec 35% CPU | 0.21 sec 30% CPU 224 Ko | 0.24 sec 2% CPU | 0.14 sec 3% CPU
The CPU usage on the t2.micro is not what it should be for the 2.4Go database and explain the extreme slowness of the request. But why ??
Best regards
Florent
Florent:
We run BaseX on EC2 micro and small instances without significant issue (though AWS servers, particularly micro/small, are not known for their high performance). Have you tried on a different instance type? Not sure how heavily loaded your instance is but understand that the t2 instance type works well for servers that are not under constant pressure (they're Burstable Performance Instances). See http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/t2-instances.html. You essentially accumulate CPU "credits" while your instance is idle, which then allows your CPU to burst to 100% when the server gets hit. If you're out of credits, your CPU may be running at as low as 10% capacity.
Now you're also workign with a 2.5Gb database on a servers with 1Gb of memory, so disk access almost inevitably comes into play. I assume you have SSD EBS volume under the hood since your on t2, but you'll be swapping, and this is more likely the problem. How much Java memory do you allocate to BaseX (-Xmx)? So I would say in your case a t2.small or preferably t2.medium would be a better choice a better choice (of course more pricy).
Talking about swap, be aware also that Amazon Linux instances don't come configured with a disk swap by default, so the 1Gb in all you get. I always add a 1Gb swap file to our micro instances using: *sudo dd if=/dev/zero of=/swapfile bs=1M count=1024 * *sudo mkswap /swapfile * *sudo swapon /swapfile* which you can verify with *sudo swapon -s *or *free -k* Also add the line below to the /etc/fstab file (so it's survives through reboot) */swapfile swap swap defaults 0 *
Hope this helps *P
On 12/18/14, 2:43 PM, Florent Gallaire wrote:
database size | shared CentOS | t2.micro EC2 CoreOS 2.4 Go | 1.5 sec 160% CPU | 30 sec 3% CPU ??!! 254 Mo | 0.35 sec 35% CPU | 0.21 sec 30% CPU 224 Ko | 0.24 sec 2% CPU | 0.14 sec 3% CPU
The CPU usage on the t2.micro is not what it should be for the 2.4Go database and explain the extreme slowness of the request. But why ??
Best regards
Florent
-- FLOSS Engineer & Lawyer
On Thu, Dec 18, 2014 at 3:35 PM, Pascal Heus pascal.heus@gmail.com wrote:
We run BaseX on EC2 micro and small instances without significant issue (though AWS servers, particularly micro/small, are not known >for their high performance). Have you tried on a different instance type? Not sure how heavily loaded your instance is but understand >that the t2 instance type works well for servers that are not under constant pressure (they're Burstable Performance Instances). See >http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/t2-instances.html. You essentially accumulate CPU "credits" while your >instance is idle, which then allows your CPU to burst to 100% when the server gets hit. If you're out of credits, your CPU may be >running at as low as 10% capacity.
Thanks a lot Pascal, I think you have the explanation I was searching for ! I have only tested EC2 micro instance because there is a 1 year free offer. And the problem is CPU bound (at the beginning I was thinking at a RAM/swap problem).
Now you're also workign with a 2.5Gb database on a servers with 1Gb of memory, so disk access almost inevitably comes into play. I >assume you have SSD EBS volume under the hood since your on t2, but you'll be swapping, and this is more likely the problem. How >much Java memory do you allocate to BaseX (-Xmx)? So I would say in your case a t2.small or preferably t2.medium would be a better >choice a better choice (of course more pricy).
Yes I use a 30 Go (max for the free offer) SSD EBS, Amazon says it performs 90 IOPS. I use the -Xmx default of 512Mo and I run that in a docker container (CoreOS powered).
Talking about swap, be aware also that Amazon Linux instances don't come configured with a disk swap by default, so the 1Gb in all you >get. I always add a 1Gb swap file to our micro instances using: sudo dd if=/dev/zero of=/swapfile bs=1M count=1024 sudo mkswap /swapfile sudo swapon /swapfile which you can verify with sudo swapon -s or free -k Also add the line below to the /etc/fstab file (so it's survives through reboot) /swapfile swap swap defaults 0
Thank for the tip !
Florent
Very good news ! Since I know how to do my request with XPath and not use direct path to file anymore, the performance are excellent !
database size | direct path | XPath 2.4 Go | 1.5 sec 160% CPU | 0.25 sec 6% CPU 224 Ko | 0.24 sec 2% CPU | 0.24 sec 2% CPU
It looks that path query have poor performances with a huge number of files in the database, what seems fair to me since BaseX is optimized for XMLish uses.
Best regards
Florent
On Thu, Dec 18, 2014 at 4:50 PM, Florent Gallaire fgallaire@gmail.com wrote:
On Thu, Dec 18, 2014 at 3:35 PM, Pascal Heus pascal.heus@gmail.com
wrote:
We run BaseX on EC2 micro and small instances without significant issue
(though AWS servers, particularly micro/small, are not known >for their high performance). Have you tried on a different instance type? Not sure how heavily loaded your instance is but understand >that the t2 instance type works well for servers that are not under constant pressure (they're Burstable Performance Instances). See > http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/t2-instances.html. You essentially accumulate CPU "credits" while your >instance is idle, which then allows your CPU to burst to 100% when the server gets hit. If you're out of credits, your CPU may be >running at as low as 10% capacity.
Thanks a lot Pascal, I think you have the explanation I was searching for
!
I have only tested EC2 micro instance because there is a 1 year free offer. And the problem is CPU bound (at the beginning I was thinking at a RAM/swap problem).
Now you're also workign with a 2.5Gb database on a servers with 1Gb of
memory, so disk access almost inevitably comes into play. I >assume you have SSD EBS volume under the hood since your on t2, but you'll be swapping, and this is more likely the problem. How >much Java memory do you allocate to BaseX (-Xmx)? So I would say in your case a t2.small or preferably t2.medium would be a better >choice a better choice (of course more pricy).
Yes I use a 30 Go (max for the free offer) SSD EBS, Amazon says it performs 90 IOPS. I use the -Xmx default of 512Mo and I run that in a docker container (CoreOS powered).
Talking about swap, be aware also that Amazon Linux instances don't come
configured with a disk swap by default, so the 1Gb in all you >get. I always add a 1Gb swap file to our micro instances using:
sudo dd if=/dev/zero of=/swapfile bs=1M count=1024 sudo mkswap /swapfile sudo swapon /swapfile which you can verify with sudo swapon -s or free -k Also add the line below to the /etc/fstab file (so it's survives through
reboot)
/swapfile swap swap defaults 0
Thank for the tip !
Florent
-- FLOSS Engineer & Lawyer
basex-talk@mailman.uni-konstanz.de