Hie,
For the purposes of European Water Framework Directive reporting, I compared the performances of the Saxon and BaseX XQuery engines. I observe a performance gap of a factor of 100 to 200 depending on the use case (see functions test_xquery_monitoring() and test_xquery_multischema_2022() in scripts test_saxoncee.py and test_basex.sh available at https://outil-transferts.ofb.fr/?107ae461a144d0b) Can you please help me understand the reasons for such gaps ?
Thanks in advance,
Antonio Andrade
Date engineer
Hi Antonio, my experience is very different - quite comparable performance, except for very specific cases, e.g. massive use of fn:idref(). Furthermore, the performance of BaseX is often so stupendous that an improvement by an order of magnitude (not to mention two) appears to me very difficult to imagine. It makes me suspicious that one of your scripts is .py, the other .sh. I believe the scripts used for comparing should be absolutely analogous.
Kind regards,Hans-Jürgen Am Freitag, 19. April 2024 um 10:46:00 MESZ hat ANDRADE Antonio antonio.andrade@ofb.gouv.fr Folgendes geschrieben:
<!--#yiv5963625419 filtered {}#yiv5963625419 filtered {}#yiv5963625419 p.yiv5963625419MsoNormal, #yiv5963625419 li.yiv5963625419MsoNormal, #yiv5963625419 div.yiv5963625419MsoNormal {margin:0cm;margin-bottom:.0001pt;font-size:11.0pt;font-family:"Calibri", sans-serif;}#yiv5963625419 a:link, #yiv5963625419 span.yiv5963625419MsoHyperlink {color:#0563C1;text-decoration:underline;}#yiv5963625419 a:visited, #yiv5963625419 span.yiv5963625419MsoHyperlinkFollowed {color:#954F72;text-decoration:underline;}#yiv5963625419 span.yiv5963625419EmailStyle17 {font-family:"Calibri", sans-serif;color:windowtext;}#yiv5963625419 .yiv5963625419MsoChpDefault {}#yiv5963625419 filtered {}#yiv5963625419 div.yiv5963625419WordSection1 {}--> Hie,
For the purposes of European Water Framework Directive reporting, I compared the performances of the Saxon and BaseX XQuery engines. I observe a performance gap of a factor of 100 to 200 depending on the use case (see functions test_xquery_monitoring() and test_xquery_multischema_2022() in scripts test_saxoncee.py and test_basex.sh available at https://outil-transferts.ofb.fr/?107ae461a144d0b) Can you please help me understand the reasons for such gaps ?
Thanks in advance,
Antonio Andrade
Date engineer
Thanks for your feedback. I haven't found a Python API around the BaseX client. For convenience, I carried out my first tests with a bash script. In the meantime, I carried out other tests by creating Java processes from a Python script. I observe roughly identical performance differences. The Python/bash difference for the calling script does not seem to explain the observed performance differences.
De : Hans-Juergen Rennau hrennau@yahoo.de Envoyé : vendredi 19 avril 2024 11:25 À : basex-talk@mailman.uni-konstanz.de; ANDRADE Antonio antonio.andrade@ofb.gouv.fr Objet : Re: [basex-talk] Performance issue with BaseX CLI
Hi Antonio,
my experience is very different - quite comparable performance, except for very specific cases, e.g. massive use of fn:idref(). Furthermore, the performance of BaseX is often so stupendous that an improvement by an order of magnitude (not to mention two) appears to me very difficult to imagine.
It makes me suspicious that one of your scripts is .py, the other .sh. I believe the scripts used for comparing should be absolutely analogous.
Kind regards,
Hans-Jürgen
Am Freitag, 19. April 2024 um 10:46:00 MESZ hat ANDRADE Antonio <antonio.andrade@ofb.gouv.fr mailto:antonio.andrade@ofb.gouv.fr > Folgendes geschrieben:
Hie,
For the purposes of European Water Framework Directive reporting, I compared the performances of the Saxon and BaseX XQuery engines. I observe a performance gap of a factor of 100 to 200 depending on the use case (see functions test_xquery_monitoring() and test_xquery_multischema_2022() in scripts test_saxoncee.py and test_basex.sh available at https://outil-transferts.ofb.fr/?107ae461a144d0b https://antiphishing.vadesecure.com/v4?f=SXFHV0doZ0hlNkF0enZmVuoCM95WeuaChRyIrE708OdqC5mxr0AhHc03wwqVfo0f&i=empzeXJKYXZmc05YYWxacww79GpiYj3SR6XpwV_AaxA&k=NcQA&r=VmtndDVTbzdiM2ZTWE5zNMCsMM_WbQ9BmpGSHR9MWkjz9QKQ-9XjLqshePQYk6Xv&s=41f902a2f3b242ea9e4bc054e42a5312eb15c66b88bc1712485b0414a23cf440&u=https%3A%2F%2Foutil-transferts.ofb.fr%2F%3F107ae461a144d0b ) Can you please help me understand the reasons for such gaps ?
Thanks in advance,
Antonio Andrade
Date engineer
Am 19.04.2024 um 10:45 schrieb ANDRADE Antonio:
Hie,
For the purposes of European Water Framework Directive reporting, I compared the performances of the Saxon and BaseX XQuery engines. I observe a performance gap of a factor of 100 to 200 depending on the use case (see functions test_xquery_monitoring() and test_xquery_multischema_2022() in scripts test_saxoncee.py and test_basex.sh available at https://outil-transferts.ofb.fr/?107ae461a144d0b) Can you please help me understand the reasons for such gaps ?
I haven't tried to look at your files either but would also say that SaxonC from Python is usually faster than Saxon Java when run from a shell script so some difference you might see is just the advantage of the AOT compiled SaxonC over running a classic Java app from a shell script where JVM start up/warm up is making a single run of code seem always relatively slow.
On Fri, 2024-04-19 at 10:45 +0200, ANDRADE Antonio wrote:
Hie, For the purposes of European Water Framework Directive reporting, I compared the performances of the Saxon and BaseX XQuery engines.
First, you should consider (as i think Martin said) the Java runtime startup time, typically a second or so.
Second, BaseX is a database. If you will process the same document many times, first load it into a database and then use the Python BaseX client. This will avoid startup time, and, more importantly, will allow BaseX to make use of database indexes.
If you will only process any given document once, then Saxon may well be the appropriate tool.
liam
mailto:liam@fromoldbooks.org @Liam R. E. Quin : Thanks for your feedback. The processing time is between 2 minutes and more than 11 hours (see table below). Thus, the loading time of the Java virtual machine has little impact. The main XQuery script loads the XML document once at the start of processing. It is then requested several times as part of more or less complex quality controls. At this moment, the XML document is not intended to be stored. This is why it is not loaded into a database before processing.
Saxon
BaseX
Start
Stop
Elapse time
Start
Stop
Elapse time
Check Monitoring 2022 FRH
06:16:54
06:19:30
00:02:36
06:44:06
10:05:21
03:21:15
Check Multi schéma 2022 FRH
06:25:46
06:31:47
00:06:01
10:05:55
11:39:07
01:33:12
De : Liam R. E. Quin liam@fromoldbooks.org Envoyé : samedi 20 avril 2024 05:00 À : ANDRADE Antonio antonio.andrade@ofb.gouv.fr; basex-talk@mailman.uni-konstanz.de Objet : Re: [basex-talk] Performance issue with BaseX CLI
On Fri, 2024-04-19 at 10:45 +0200, ANDRADE Antonio wrote:
Hie,
For the purposes of European Water Framework Directive reporting, I compared the performances of the Saxon and BaseX XQuery engines.
First, you should consider (as i think Martin said) the Java runtime startup time, typically a second or so.
Second, BaseX is a database. If you will process the same document many times, first load it into a database and then use the Python BaseX client. This will avoid startup time, and, more importantly, will allow BaseX to make use of database indexes.
If you will only process any given document once, then Saxon may well be the appropriate tool.
liam
On Mon, 2024-04-22 at 08:54 +0200, ANDRADE Antonio wrote:
At this moment, the XML document is not intended to be stored. This is why it is not loaded into a database before processing.
BaseX is designed to operate primarily on documents in the database, which is why i suggest trying that.
Otherwise it’s be like comparing Excel and Oracle Database in a benchmark that loaded a CSV file for each query, and concluding Excel was faster :-)
liam
Hi Antonio,
As Liam indicated, you may get better performance when adding your documents to a database.
In general, though, the runtimes of BaseX and Saxon have aligned pretty much over the years, and I assume there’ll be a trivial reason behind the drastic difference in the runtime.
Your test setup is probably too complex for us readers to spend more time with it. Could you possibly share a more basic example with us, ideally with a single document and query file?
Thanks in advance, Christian
On Mon, Apr 22, 2024 at 8:54 AM ANDRADE Antonio antonio.andrade@ofb.gouv.fr wrote:
@Liam R. E. Quin liam@fromoldbooks.org : Thanks for your feedback. The processing time is between 2 minutes and more than 11 hours (see table below). Thus, the loading time of the Java virtual machine has little impact. The main XQuery script loads the XML document once at the start of processing. It is then requested several times as part of more or less complex quality controls. At this moment, the XML document is not intended to be stored. This is why it is not loaded into a database before processing.
*Saxon*
*BaseX*
*Start*
*Stop*
*Elapse time*
*Start*
*Stop*
*Elapse time*
Check Monitoring 2022 FRH
06:16:54
06:19:30
00:02:36
06:44:06
10:05:21
03:21:15
Check Multi schéma 2022 FRH
06:25:46
06:31:47
00:06:01
10:05:55
11:39:07
01:33:12
*De :* Liam R. E. Quin liam@fromoldbooks.org *Envoyé :* samedi 20 avril 2024 05:00 *À :* ANDRADE Antonio antonio.andrade@ofb.gouv.fr; basex-talk@mailman.uni-konstanz.de *Objet :* Re: [basex-talk] Performance issue with BaseX CLI
On Fri, 2024-04-19 at 10:45 +0200, ANDRADE Antonio wrote:
Hie,
For the purposes of European Water Framework Directive reporting, I compared the performances of the Saxon and BaseX XQuery engines.
First, you should consider (as i think Martin said) the Java runtime startup time, typically a second or so.
Second, BaseX is a database. If you will process the same document many times, first load it into a database and then use the Python BaseX client. This will avoid startup time, and, more importantly, will allow BaseX to make use of database indexes.
If you will only process any given document once, then Saxon may well be the appropriate tool.
liam
--
Liam Quin, https://www.delightfulcomputing.com/ https://antiphishing.vadesecure.com/v4?f=SnpNUUNxek1BTWh6ZFZjaWCyrlumiLHtHmHGdEVdgTGAg0gyDE-v9PTNgKgfV2Nw&i=cHp0TkJvdm11bGhoR250SmgzWyo1rr-iN9AzEpeQLkU&k=6xq5&r=Z1RORVRCV0NEb2hhaDhMZNAVNIuDyvWRZH6WeNsm4siBbLteM10PATfmacXtXZrM&s=4c7cb6bdfca9fe7ddb2b3683dad19c0df1db6cbdc34171049937c0a5791ef479&u=https%3A%2F%2Fwww.delightfulcomputing.com%2F
Available for XML/Document/Information Architecture/XSLT/
XSL/XQuery/Web/Text Processing/A11Y training, work & consulting.
Barefoot Web-slave, antique illustrations: http://www.fromoldbooks.org https://antiphishing.vadesecure.com/v4?f=SnpNUUNxek1BTWh6ZFZjaWCyrlumiLHtHmHGdEVdgTGAg0gyDE-v9PTNgKgfV2Nw&i=cHp0TkJvdm11bGhoR250SmgzWyo1rr-iN9AzEpeQLkU&k=6xq5&r=Z1RORVRCV0NEb2hhaDhMZNAVNIuDyvWRZH6WeNsm4siBbLteM10PATfmacXtXZrM&s=8296c1235680d7664e4428bb2543916368c7e78a7235acc5e660e575e227a9fd&u=http%3A%2F%2Fwww.fromoldbooks.org%2F
Hi again,
I had a quick look into the monitoring code, and I noticed two things:
1. It looks to me (correct me if I’m wrong) as if the code of the project was initially written for Saxon and then ported to BaseX. If you are interested in using BaseX, you could focus on the slow functions, try alternative writings and (if you want to run the code with both processors in the future) ensure that Saxon still gives delivers good performance.
2. Some functions can be noticeably sped up (for both BaseX and Saxon) if you use XQuery 3.1 features such as maps or group by. For example, the runtime of #131014 could possibly be reduced with something similar to…
for $ms in $Monitoring/*:MonitoringSite let $emsc := $ms/*:euMonitoringSiteCode for $ceqm in $ms/*:ChemicalEcologicalQuantitativeMonitoring let $V_rech := $ceqm/*:parameterCode || '/' || $ceqm/*:parameterOther || '/' || $ceqm/*:chemicalMatrix group by $group := $emsc || ': ' || $V_rech where count($ceqm) > 1 return $V_rech
If BaseX turns out to be the way to go, it’s definitely worth taking advantage of the database aspect. In BaseX, databases are fairly light-weight, which means you can simply create them before running the queries (e.g., with a single 'CREATE DB poc /path/to/poc_rapportage_controle-main/xml' command) and use db:get('poc', 'your-doc.xml') in the queries to access a document (or even stick with doc('your-doc.xml') if you enable DEFAULTDB [1]).
Hope this helps, Christian
[1] https://docs.basex.org/wiki/Options#DEFAULTDB
On Mon, Apr 22, 2024 at 9:32 AM Christian Grün christian.gruen@gmail.com wrote:
Hi Antonio,
As Liam indicated, you may get better performance when adding your documents to a database.
In general, though, the runtimes of BaseX and Saxon have aligned pretty much over the years, and I assume there’ll be a trivial reason behind the drastic difference in the runtime.
Your test setup is probably too complex for us readers to spend more time with it. Could you possibly share a more basic example with us, ideally with a single document and query file?
Thanks in advance, Christian
On Mon, Apr 22, 2024 at 8:54 AM ANDRADE Antonio < antonio.andrade@ofb.gouv.fr> wrote:
@Liam R. E. Quin liam@fromoldbooks.org : Thanks for your feedback. The processing time is between 2 minutes and more than 11 hours (see table below). Thus, the loading time of the Java virtual machine has little impact. The main XQuery script loads the XML document once at the start of processing. It is then requested several times as part of more or less complex quality controls. At this moment, the XML document is not intended to be stored. This is why it is not loaded into a database before processing.
*Saxon*
*BaseX*
*Start*
*Stop*
*Elapse time*
*Start*
*Stop*
*Elapse time*
Check Monitoring 2022 FRH
06:16:54
06:19:30
00:02:36
06:44:06
10:05:21
03:21:15
Check Multi schéma 2022 FRH
06:25:46
06:31:47
00:06:01
10:05:55
11:39:07
01:33:12
*De :* Liam R. E. Quin liam@fromoldbooks.org *Envoyé :* samedi 20 avril 2024 05:00 *À :* ANDRADE Antonio antonio.andrade@ofb.gouv.fr; basex-talk@mailman.uni-konstanz.de *Objet :* Re: [basex-talk] Performance issue with BaseX CLI
On Fri, 2024-04-19 at 10:45 +0200, ANDRADE Antonio wrote:
Hie,
For the purposes of European Water Framework Directive reporting, I compared the performances of the Saxon and BaseX XQuery engines.
First, you should consider (as i think Martin said) the Java runtime startup time, typically a second or so.
Second, BaseX is a database. If you will process the same document many times, first load it into a database and then use the Python BaseX client. This will avoid startup time, and, more importantly, will allow BaseX to make use of database indexes.
If you will only process any given document once, then Saxon may well be the appropriate tool.
liam
--
Liam Quin, https://www.delightfulcomputing.com/ https://antiphishing.vadesecure.com/v4?f=SnpNUUNxek1BTWh6ZFZjaWCyrlumiLHtHmHGdEVdgTGAg0gyDE-v9PTNgKgfV2Nw&i=cHp0TkJvdm11bGhoR250SmgzWyo1rr-iN9AzEpeQLkU&k=6xq5&r=Z1RORVRCV0NEb2hhaDhMZNAVNIuDyvWRZH6WeNsm4siBbLteM10PATfmacXtXZrM&s=4c7cb6bdfca9fe7ddb2b3683dad19c0df1db6cbdc34171049937c0a5791ef479&u=https%3A%2F%2Fwww.delightfulcomputing.com%2F
Available for XML/Document/Information Architecture/XSLT/
XSL/XQuery/Web/Text Processing/A11Y training, work & consulting.
Barefoot Web-slave, antique illustrations: http://www.fromoldbooks.org https://antiphishing.vadesecure.com/v4?f=SnpNUUNxek1BTWh6ZFZjaWCyrlumiLHtHmHGdEVdgTGAg0gyDE-v9PTNgKgfV2Nw&i=cHp0TkJvdm11bGhoR250SmgzWyo1rr-iN9AzEpeQLkU&k=6xq5&r=Z1RORVRCV0NEb2hhaDhMZNAVNIuDyvWRZH6WeNsm4siBbLteM10PATfmacXtXZrM&s=8296c1235680d7664e4428bb2543916368c7e78a7235acc5e660e575e227a9fd&u=http%3A%2F%2Fwww.fromoldbooks.org%2F
Hie Christian,
You're right : historically, the XQuery code was executed with the Saxon engine. This is no longer possible without paying a license. In addition to the cost generated, this limits the replicability of the processing. This is why we are evaluating the BaseX solution.
I don't see how to profile XQuery code. I will carry out tests with a database. I will also improve the syntax of some queries. I will keep you informed of the results.
Thanks a lot,
Antonio
De : Christian Grün christian.gruen@gmail.com Envoyé : lundi 22 avril 2024 13:45 À : ANDRADE Antonio antonio.andrade@ofb.gouv.fr Cc : basex-talk@mailman.uni-konstanz.de Objet : Re: [basex-talk] Performance issue with BaseX CLI
Hi again,
I had a quick look into the monitoring code, and I noticed two things:
1. It looks to me (correct me if I’m wrong) as if the code of the project was initially written for Saxon and then ported to BaseX. If you are interested in using BaseX, you could focus on the slow functions, try alternative writings and (if you want to run the code with both processors in the future) ensure that Saxon still gives delivers good performance.
2. Some functions can be noticeably sped up (for both BaseX and Saxon) if you use XQuery 3.1 features such as maps or group by. For example, the runtime of #131014 could possibly be reduced with something similar to…
for $ms in $Monitoring/*:MonitoringSite let $emsc := $ms/*:euMonitoringSiteCode for $ceqm in $ms/*:ChemicalEcologicalQuantitativeMonitoring let $V_rech := $ceqm/*:parameterCode || '/' || $ceqm/*:parameterOther || '/' || $ceqm/*:chemicalMatrix group by $group := $emsc || ': ' || $V_rech where count($ceqm) > 1 return $V_rech
If BaseX turns out to be the way to go, it’s definitely worth taking advantage of the database aspect. In BaseX, databases are fairly light-weight, which means you can simply create them before running the queries (e.g., with a single 'CREATE DB poc /path/to/poc_rapportage_controle-main/xml' command) and use db:get('poc', 'your-doc.xml') in the queries to access a document (or even stick with doc('your-doc.xml') if you enable DEFAULTDB [1]).
Hope this helps,
Christian
[1] https://antiphishing.vadesecure.com/v4?f=TzBPM05TMWhaUkVuRncweoPMjK2QCEAycDsFPXW7oVXv7fvatzw4hMuVApRk99dY&i=NmhwUWdPbjljNWRxSlVxNUTscW_hVGAeTpWfW3ms-T8&k=Yhxs&r=bGlzcjBDTTd4VWcyWjZtQ73JEaKYtalSlfUg_UlIdniAsQJc8JIQqvObJohKyZTu&s=397bd47452d05252330ef5fc1fa5598015668b2d3656a078232680d7881de307&u=https%3A%2F%2Fdocs.basex.org%2Fwiki%2FOptions%23DEFAULTDB https://docs.basex.org/wiki/Options#DEFAULTDB
On Mon, Apr 22, 2024 at 9:32 AM Christian Grün <christian.gruen@gmail.com mailto:christian.gruen@gmail.com > wrote:
Hi Antonio,
As Liam indicated, you may get better performance when adding your documents to a database.
In general, though, the runtimes of BaseX and Saxon have aligned pretty much over the years, and I assume there’ll be a trivial reason behind the drastic difference in the runtime.
Your test setup is probably too complex for us readers to spend more time with it. Could you possibly share a more basic example with us, ideally with a single document and query file?
Thanks in advance,
Christian
On Mon, Apr 22, 2024 at 8:54 AM ANDRADE Antonio <antonio.andrade@ofb.gouv.fr mailto:antonio.andrade@ofb.gouv.fr > wrote:
mailto:liam@fromoldbooks.org @Liam R. E. Quin : Thanks for your feedback. The processing time is between 2 minutes and more than 11 hours (see table below). Thus, the loading time of the Java virtual machine has little impact. The main XQuery script loads the XML document once at the start of processing. It is then requested several times as part of more or less complex quality controls. At this moment, the XML document is not intended to be stored. This is why it is not loaded into a database before processing.
Saxon
BaseX
Start
Stop
Elapse time
Start
Stop
Elapse time
Check Monitoring 2022 FRH
06:16:54
06:19:30
00:02:36
06:44:06
10:05:21
03:21:15
Check Multi schéma 2022 FRH
06:25:46
06:31:47
00:06:01
10:05:55
11:39:07
01:33:12
De : Liam R. E. Quin <liam@fromoldbooks.org mailto:liam@fromoldbooks.org > Envoyé : samedi 20 avril 2024 05:00 À : ANDRADE Antonio <antonio.andrade@ofb.gouv.fr mailto:antonio.andrade@ofb.gouv.fr >; basex-talk@mailman.uni-konstanz.de mailto:basex-talk@mailman.uni-konstanz.de Objet : Re: [basex-talk] Performance issue with BaseX CLI
On Fri, 2024-04-19 at 10:45 +0200, ANDRADE Antonio wrote:
Hie,
For the purposes of European Water Framework Directive reporting, I compared the performances of the Saxon and BaseX XQuery engines.
First, you should consider (as i think Martin said) the Java runtime startup time, typically a second or so.
Second, BaseX is a database. If you will process the same document many times, first load it into a database and then use the Python BaseX client. This will avoid startup time, and, more importantly, will allow BaseX to make use of database indexes.
If you will only process any given document once, then Saxon may well be the appropriate tool.
liam
basex-talk@mailman.uni-konstanz.de