Hi,
in a recent update of REx parser generator (v5.37), some type specifications were removed from generated XQuery and XSLT functions. This affects tail-recursive functions, and it was done because it turned out that MarkLogic fails to optimize tail calls, presumably due to an explicit type check caused by the type specification (see http://markmail.org/message/gxi26da4crk2v5ge ).
Now it was reported that this change causes performance problems with BaseX (see https://twitter.com/apb1704/status/714219874441146368 ).
I reproduced the problem as follows:
- download the XQuery 3.1 grammar from http://bottlecaps.de/rex/CR-xquery-31-20151217.ebnf
- generate a parser from it on http://bottlecaps.de/rex/ using command line options:
-xquery -tree -main
- run the generated parser, using this command line
basex -binput={42} CR-xquery-31-20151217.xquery
This works OK, but it takes about 80 seconds. Some analysis showed that the time can be influenced by putting back the type specification to function p:map2, i.e. declaring 'as xs:integer' as the result type:
declare function p:map2($c as xs:integer, $lo as xs:integer, $hi as xs:integer) as xs:integer
This variant completes in less than 3 seconds. But even when declaring a return type of
as xs:integer?
which might be the type that can be inferred statically, it completes fast.
Is this possibly a problem with the optimizer?
Which variant of generated code would be preferable for BaseX - typed or untyped?
Thanks, Gunther
Hi Gunther,
Thanks for your mail. After some research, I don’t see a quick way to statically infer the type in the function you mentioned, mostly because it’s called recursively (while statically inferring the type of the function, the return type of map2 function is requested, and it will be the type of the function signature, because the inferred type is not available yet…).
Personally, I would love to see the return types readded, not only because it speeds up BaseX, but also because I think it’s a general advantage of XQuery to have typed functions, and . Do you think there is any chance to revert the change? Did you hear sth. about the performance of MarkLogic – is the version without argument types comparable to BaseX in terms of execution time, or much faster?
All the best, Christian
in a recent update of REx parser generator (v5.37), some type specifications were removed from generated XQuery and XSLT functions. This affects tail-recursive functions, and it was done because it turned out that MarkLogic fails to optimize tail calls, presumably due to an explicit type check caused by the type specification (see http://markmail.org/message/gxi26da4crk2v5ge ).
Now it was reported that this change causes performance problems with BaseX (see https://twitter.com/apb1704/status/714219874441146368 ).
I reproduced the problem as follows:
- download the XQuery 3.1 grammar from
http://bottlecaps.de/rex/CR-xquery-31-20151217.ebnf
- generate a parser from it on http://bottlecaps.de/rex/ using command
line options:
-xquery -tree -main
run the generated parser, using this command line
basex -binput={42} CR-xquery-31-20151217.xquery
This works OK, but it takes about 80 seconds. Some analysis showed that the time can be influenced by putting back the type specification to function p:map2, i.e. declaring 'as xs:integer' as the result type:
declare function p:map2($c as xs:integer, $lo as xs:integer, $hi as
xs:integer) as xs:integer
This variant completes in less than 3 seconds. But even when declaring a return type of
as xs:integer?
which might be the type that can be inferred statically, it completes fast.
Is this possibly a problem with the optimizer?
Which variant of generated code would be preferable for BaseX - typed or untyped?
Thanks, Gunther
Did you hear sth. about the performance of MarkLogic
My understanding is that on ML the presence typing blocks some tail-recursion optimizations leading to stack overflow for large texts.
Stack overflow is also an issue for my use of REx with BaseX, I would like to be able to parse sources like the FunctX library without having to set a large Java stack size. [1] (My report that the issue was fixed was incorrect). I would like to investigate if BaseX is using tail-recursion optimizations here and/or where/why this overflow happens - but it all looks very complicated :-).
I suspect the users of XQuery REx are capable of hand tweaking the REx o/p either to add or remove type declarations once the issues are well documented.
/Andy
[1] http://www.mail-archive.com/basex-talk%40mailman.uni-konstanz.de/msg07212.ht...
On 29 March 2016 at 16:24, Christian Grün christian.gruen@gmail.com wrote:
Hi Gunther,
Thanks for your mail. After some research, I don’t see a quick way to statically infer the type in the function you mentioned, mostly because it’s called recursively (while statically inferring the type of the function, the return type of map2 function is requested, and it will be the type of the function signature, because the inferred type is not available yet…).
Personally, I would love to see the return types readded, not only because it speeds up BaseX, but also because I think it’s a general advantage of XQuery to have typed functions, and . Do you think there is any chance to revert the change? Did you hear sth. about the performance of MarkLogic – is the version without argument types comparable to BaseX in terms of execution time, or much faster?
All the best, Christian
in a recent update of REx parser generator (v5.37), some type
specifications
were removed from generated XQuery and XSLT functions. This affects tail-recursive functions, and it was done because it turned out that MarkLogic fails to optimize tail calls, presumably due to an explicit
type
check caused by the type specification (see http://markmail.org/message/gxi26da4crk2v5ge ).
Now it was reported that this change causes performance problems with
BaseX
(see https://twitter.com/apb1704/status/714219874441146368 ).
I reproduced the problem as follows:
- download the XQuery 3.1 grammar from
http://bottlecaps.de/rex/CR-xquery-31-20151217.ebnf
- generate a parser from it on http://bottlecaps.de/rex/ using command
line options:
-xquery -tree -main
run the generated parser, using this command line
basex -binput={42} CR-xquery-31-20151217.xquery
This works OK, but it takes about 80 seconds. Some analysis showed that
the
time can be influenced by putting back the type specification to function p:map2, i.e. declaring 'as xs:integer' as the result type:
declare function p:map2($c as xs:integer, $lo as xs:integer, $hi
as
xs:integer) as xs:integer
This variant completes in less than 3 seconds. But even when declaring a return type of
as xs:integer?
which might be the type that can be inferred statically, it completes
fast.
Is this possibly a problem with the optimizer?
Which variant of generated code would be preferable for BaseX - typed or untyped?
Thanks, Gunther
Looking again at Florent's message http://markmail.org/message/gxi26da4crk2v5ge... After testing it seems is the length of the XQuery comments that trigger the stack overflow in BaseX as well. /Andy
On 29 March 2016 at 16:54, Andy Bunce bunce.andy@gmail.com wrote:
Did you hear sth. about the performance of MarkLogic
My understanding is that on ML the presence typing blocks some tail-recursion optimizations leading to stack overflow for large texts.
Stack overflow is also an issue for my use of REx with BaseX, I would like to be able to parse sources like the FunctX library without having to set a large Java stack size. [1] (My report that the issue was fixed was incorrect). I would like to investigate if BaseX is using tail-recursion optimizations here and/or where/why this overflow happens - but it all looks very complicated :-).
I suspect the users of XQuery REx are capable of hand tweaking the REx o/p either to add or remove type declarations once the issues are well documented.
/Andy
[1] http://www.mail-archive.com/basex-talk%40mailman.uni-konstanz.de/msg07212.ht...
On 29 March 2016 at 16:24, Christian Grün christian.gruen@gmail.com wrote:
Hi Gunther,
Thanks for your mail. After some research, I don’t see a quick way to statically infer the type in the function you mentioned, mostly because it’s called recursively (while statically inferring the type of the function, the return type of map2 function is requested, and it will be the type of the function signature, because the inferred type is not available yet…).
Personally, I would love to see the return types readded, not only because it speeds up BaseX, but also because I think it’s a general advantage of XQuery to have typed functions, and . Do you think there is any chance to revert the change? Did you hear sth. about the performance of MarkLogic – is the version without argument types comparable to BaseX in terms of execution time, or much faster?
All the best, Christian
in a recent update of REx parser generator (v5.37), some type
specifications
were removed from generated XQuery and XSLT functions. This affects tail-recursive functions, and it was done because it turned out that MarkLogic fails to optimize tail calls, presumably due to an explicit
type
check caused by the type specification (see http://markmail.org/message/gxi26da4crk2v5ge ).
Now it was reported that this change causes performance problems with
BaseX
(see https://twitter.com/apb1704/status/714219874441146368 ).
I reproduced the problem as follows:
- download the XQuery 3.1 grammar from
http://bottlecaps.de/rex/CR-xquery-31-20151217.ebnf
- generate a parser from it on http://bottlecaps.de/rex/ using
command
line options:
-xquery -tree -main
run the generated parser, using this command line
basex -binput={42} CR-xquery-31-20151217.xquery
This works OK, but it takes about 80 seconds. Some analysis showed that
the
time can be influenced by putting back the type specification to
function
p:map2, i.e. declaring 'as xs:integer' as the result type:
declare function p:map2($c as xs:integer, $lo as xs:integer, $hi
as
xs:integer) as xs:integer
This variant completes in less than 3 seconds. But even when declaring a return type of
as xs:integer?
which might be the type that can be inferred statically, it completes
fast.
Is this possibly a problem with the optimizer?
Which variant of generated code would be preferable for BaseX - typed or untyped?
Thanks, Gunther
So we may probably need to find out if the query can be optimized at all…
On Tue, Mar 29, 2016 at 7:01 PM, Andy Bunce bunce.andy@gmail.com wrote:
Looking again at Florent's message http://markmail.org/message/gxi26da4crk2v5ge... After testing it seems is the length of the XQuery comments that trigger the stack overflow in BaseX as well. /Andy
On 29 March 2016 at 16:54, Andy Bunce bunce.andy@gmail.com wrote:
Did you hear sth. about the performance of MarkLogic
My understanding is that on ML the presence typing blocks some tail-recursion optimizations leading to stack overflow for large texts.
Stack overflow is also an issue for my use of REx with BaseX, I would like to be able to parse sources like the FunctX library without having to set a large Java stack size. [1] (My report that the issue was fixed was incorrect). I would like to investigate if BaseX is using tail-recursion optimizations here and/or where/why this overflow happens - but it all looks very complicated :-).
I suspect the users of XQuery REx are capable of hand tweaking the REx o/p either to add or remove type declarations once the issues are well documented.
/Andy
[1] http://www.mail-archive.com/basex-talk%40mailman.uni-konstanz.de/msg07212.ht...
On 29 March 2016 at 16:24, Christian Grün christian.gruen@gmail.com wrote:
Hi Gunther,
Thanks for your mail. After some research, I don’t see a quick way to statically infer the type in the function you mentioned, mostly because it’s called recursively (while statically inferring the type of the function, the return type of map2 function is requested, and it will be the type of the function signature, because the inferred type is not available yet…).
Personally, I would love to see the return types readded, not only because it speeds up BaseX, but also because I think it’s a general advantage of XQuery to have typed functions, and . Do you think there is any chance to revert the change? Did you hear sth. about the performance of MarkLogic – is the version without argument types comparable to BaseX in terms of execution time, or much faster?
All the best, Christian
in a recent update of REx parser generator (v5.37), some type specifications were removed from generated XQuery and XSLT functions. This affects tail-recursive functions, and it was done because it turned out that MarkLogic fails to optimize tail calls, presumably due to an explicit type check caused by the type specification (see http://markmail.org/message/gxi26da4crk2v5ge ).
Now it was reported that this change causes performance problems with BaseX (see https://twitter.com/apb1704/status/714219874441146368 ).
I reproduced the problem as follows:
- download the XQuery 3.1 grammar from
http://bottlecaps.de/rex/CR-xquery-31-20151217.ebnf
- generate a parser from it on http://bottlecaps.de/rex/ using
command line options:
-xquery -tree -main
run the generated parser, using this command line
basex -binput={42} CR-xquery-31-20151217.xquery
This works OK, but it takes about 80 seconds. Some analysis showed that the time can be influenced by putting back the type specification to function p:map2, i.e. declaring 'as xs:integer' as the result type:
declare function p:map2($c as xs:integer, $lo as xs:integer, $hi
as xs:integer) as xs:integer
This variant completes in less than 3 seconds. But even when declaring a return type of
as xs:integer?
which might be the type that can be inferred statically, it completes fast.
Is this possibly a problem with the optimizer?
Which variant of generated code would be preferable for BaseX - typed or untyped?
Thanks, Gunther
Hi Gunther, Do I under understand correctly from: https://twitter.com/__Gunther__/status/709744679361912832
Given the way the EBNF for XQuery comments is defined - If (REx) p:transition is not optimized via tail-recursion then long comments will use excessive stack? /Andy
On 29 March 2016 at 18:05, Christian Grün christian.gruen@gmail.com wrote:
So we may probably need to find out if the query can be optimized at all…
On Tue, Mar 29, 2016 at 7:01 PM, Andy Bunce bunce.andy@gmail.com wrote:
Looking again at Florent's message http://markmail.org/message/gxi26da4crk2v5ge... After testing it seems is the length of the XQuery comments that trigger
the
stack overflow in BaseX as well. /Andy
On 29 March 2016 at 16:54, Andy Bunce bunce.andy@gmail.com wrote:
Did you hear sth. about the performance of MarkLogic
My understanding is that on ML the presence typing blocks some tail-recursion optimizations leading to stack overflow for large texts.
Stack overflow is also an issue for my use of REx with BaseX, I would like to be able to parse sources like the FunctX library without
having to
set a large Java stack size. [1] (My report that the issue was fixed was incorrect). I would like to investigate if BaseX is using tail-recursion optimizations here and/or where/why this overflow happens - but it all
looks
very complicated :-).
I suspect the users of XQuery REx are capable of hand tweaking the REx
o/p
either to add or remove type declarations once the issues are well documented.
/Andy
[1]
http://www.mail-archive.com/basex-talk%40mailman.uni-konstanz.de/msg07212.ht...
On 29 March 2016 at 16:24, Christian Grün christian.gruen@gmail.com wrote:
Hi Gunther,
Thanks for your mail. After some research, I don’t see a quick way to statically infer the type in the function you mentioned, mostly because it’s called recursively (while statically inferring the type of the function, the return type of map2 function is requested, and it will be the type of the function signature, because the inferred type is not available yet…).
Personally, I would love to see the return types readded, not only because it speeds up BaseX, but also because I think it’s a general advantage of XQuery to have typed functions, and . Do you think there is any chance to revert the change? Did you hear sth. about the performance of MarkLogic – is the version without argument types comparable to BaseX in terms of execution time, or much faster?
All the best, Christian
in a recent update of REx parser generator (v5.37), some type specifications were removed from generated XQuery and XSLT functions. This affects tail-recursive functions, and it was done because it turned out that MarkLogic fails to optimize tail calls, presumably due to an explicit type check caused by the type specification (see http://markmail.org/message/gxi26da4crk2v5ge ).
Now it was reported that this change causes performance problems with BaseX (see https://twitter.com/apb1704/status/714219874441146368 ).
I reproduced the problem as follows:
- download the XQuery 3.1 grammar from
http://bottlecaps.de/rex/CR-xquery-31-20151217.ebnf
- generate a parser from it on http://bottlecaps.de/rex/ using
command line options:
-xquery -tree -main
run the generated parser, using this command line
basex -binput={42} CR-xquery-31-20151217.xquery
This works OK, but it takes about 80 seconds. Some analysis showed
that
the time can be influenced by putting back the type specification to function p:map2, i.e. declaring 'as xs:integer' as the result type:
declare function p:map2($c as xs:integer, $lo as xs:integer,
$hi
as xs:integer) as xs:integer
This variant completes in less than 3 seconds. But even when
declaring
a return type of
as xs:integer?
which might be the type that can be inferred statically, it completes fast.
Is this possibly a problem with the optimizer?
Which variant of generated code would be preferable for BaseX - typed or untyped?
Thanks, Gunther
Hi Andy,
that's correct. p:transition ought to be tail-call optimized to be prepared for tokens of arbitrary size.
Best regards Gunther
On 29.03.2016 19:17, Andy Bunce wrote:
Hi Gunther, Do I under understand correctly from: https://twitter.com/__Gunther__/status/709744679361912832
Given the way the EBNF for XQuery comments is defined - If (REx) p:transition is not optimized via tail-recursion then long comments will use excessive stack? /Andy
On 29 March 2016 at 18:05, Christian Grün <christian.gruen@gmail.com mailto:christian.gruen@gmail.com> wrote:
So we may probably need to find out if the query can be optimized at all… On Tue, Mar 29, 2016 at 7:01 PM, Andy Bunce <bunce.andy@gmail.com <mailto:bunce.andy@gmail.com>> wrote: > Looking again at Florent's message > http://markmail.org/message/gxi26da4crk2v5ge... > After testing it seems is the length of the XQuery comments that trigger the > stack overflow in BaseX as well. > /Andy > > > On 29 March 2016 at 16:54, Andy Bunce <bunce.andy@gmail.com <mailto:bunce.andy@gmail.com>> wrote: >> >> >Did you hear sth. about the performance of MarkLogic >> My understanding is that on ML the presence typing blocks some >> tail-recursion optimizations leading to stack overflow for large texts. >> >> Stack overflow is also an issue for my use of REx with BaseX, I would >> like to be able to parse sources like the FunctX library without having to >> set a large Java stack size. [1] (My report that the issue was fixed was >> incorrect). I would like to investigate if BaseX is using tail-recursion >> optimizations here and/or where/why this overflow happens - but it all looks >> very complicated :-). >> >> I suspect the users of XQuery REx are capable of hand tweaking the REx o/p >> either to add or remove type declarations once the issues are well >> documented. >> >> /Andy >> >> >> [1] >> http://www.mail-archive.com/basex-talk%40mailman.uni-konstanz.de/msg07212.html >> >> On 29 March 2016 at 16:24, Christian Grün <christian.gruen@gmail.com <mailto:christian.gruen@gmail.com>> >> wrote: >>> >>> Hi Gunther, >>> >>> Thanks for your mail. After some research, I don’t see a quick way to >>> statically infer the type in the function you mentioned, mostly >>> because it’s called recursively (while statically inferring the type >>> of the function, the return type of map2 function is requested, and it >>> will be the type of the function signature, because the inferred type >>> is not available yet…). >>> >>> Personally, I would love to see the return types readded, not only >>> because it speeds up BaseX, but also because I think it’s a general >>> advantage of XQuery to have typed functions, and . Do you think there >>> is any chance to revert the change? Did you hear sth. about the >>> performance of MarkLogic – is the version without argument types >>> comparable to BaseX in terms of execution time, or much faster? >>> >>> All the best, >>> Christian >>> >>> >>> >>> > in a recent update of REx parser generator (v5.37), some type >>> > specifications >>> > were removed from generated XQuery and XSLT functions. This affects >>> > tail-recursive functions, and it was done because it turned out that >>> > MarkLogic fails to optimize tail calls, presumably due to an explicit >>> > type >>> > check caused by the type specification (see >>> > http://markmail.org/message/gxi26da4crk2v5ge ). >>> > >>> > Now it was reported that this change causes performance problems with >>> > BaseX >>> > (see https://twitter.com/apb1704/status/714219874441146368 ). >>> > >>> > I reproduced the problem as follows: >>> > >>> > - download the XQuery 3.1 grammar from >>> > http://bottlecaps.de/rex/CR-xquery-31-20151217.ebnf >>> > >>> > - generate a parser from it on http://bottlecaps.de/rex/ using >>> > command >>> > line options: >>> > >>> > -xquery -tree -main >>> > >>> > - run the generated parser, using this command line >>> > >>> > basex -binput={42} CR-xquery-31-20151217.xquery >>> > >>> > This works OK, but it takes about 80 seconds. Some analysis showed that >>> > the >>> > time can be influenced by putting back the type specification to >>> > function >>> > p:map2, i.e. declaring 'as xs:integer' as the result type: >>> > >>> > declare function p:map2($c as xs:integer, $lo as xs:integer, $hi >>> > as >>> > xs:integer) as xs:integer >>> > >>> > This variant completes in less than 3 seconds. But even when declaring >>> > a >>> > return type of >>> > >>> > as xs:integer? >>> > >>> > which might be the type that can be inferred statically, it completes >>> > fast. >>> > >>> > Is this possibly a problem with the optimizer? >>> > >>> > Which variant of generated code would be preferable for BaseX - typed >>> > or >>> > untyped? >>> > >>> > Thanks, >>> > Gunther >> >> >
Hi Andy, hi Gunther,
Did you hear sth. about the performance of MarkLogic
My understanding is that on ML the presence typing blocks some tail-recursion optimizations leading to stack overflow for large texts.
Yes, this is how I got it, too. However, I was just wondering what’s the runtime for MarkLogic for the version without return types.
I found out it seems to be generally difficult to infer the static type of recursive functions as the one we discussed. However, I found another way to speed up the parser code:
The problem was that the positional access to sequences ($p:TRANSITION[$i0 ...]) was slow if it was not possible to statically assess that $i0 will be numeric. I have now added another implementation of the filter expression for filters with simple, deterministic predicates, which checks the predicate result type at runtime.
If the return type is omitted from p:map2(), the query is now appr. 15% slower, but this might still be ok. I would still love to see the types readded; but after all, this is not my code ;) A new snapshot is available [1].
Stack overflow is also an issue for my use of REx with BaseX, I would like to be able to parse sources like the FunctX library without having to set a large Java stack size. [1] (My report that the issue was fixed was incorrect). I would like to investigate if BaseX is using tail-recursion optimizations here and/or where/why this overflow happens - but it all looks very complicated :-).
If you find out more, please let us know!
Christian
Hi Christian,
thank you for your analysis. Here is my view of this:
- generally I would second your preference for typed code. It feels cleaner to me, and that's why the type specifications were there originally.
- tail call optimization is crucial for REx generated code. It is necessary to iteratively assemble result sequences while tokenizing/parsing, where these iterations require maintenance of the tokenizer/parser's state. The concrete tasks that use it are
1) assembling token value (p:transition), e.g. comment contents 2) handling EBNF iteration (ZeroOrMore, OneOrMore) - long lists 3) skipping over multiple whitespace items (p:matchW) 3) creating error token lists (p:token) - limited recursion 4) binary search (as in p:map2) - very limited recursion
The first three can hardly go without TCO, the others are rather limited by the number of tokens, or log2 of the number of character classes.
- REx code generation for XQuery was originally developed on Saxon, and I am pretty sure that all of the above are in favor of tail call optimization when run on Saxon, with or without the type spec.
- I have never worked with MarkLogic myself, so no experience here.
- when I ran some tests with BaseX, my impression was that tail call optimization can be achieved for p:transition. Not sure, though.
- when I dropped the type spec, I was hoping not to create other problems with it. Apparently I did, and that is reason enough to rollback this soon. I may create a new "ML" option to address their optimizer's behavior - though I'd prefer to find a common solution.
- to be honest, I am not convinced that XQuery is the ideal language for coding parsers. I was just exploring whether it could be done, when I started on this, and I found that once you have a parse tree in XDM, you can do a lot of interesting tasks on it.
- if anybody was interested, I could generate Java (or other) extension functions for some XQuery processor, like I do for Saxon. That would combine high performance with ease of use. I would love to see Adam Retter's proposal for portable Expath extension functions ( http://www.adamretter.org.uk/presentations/implementation-of-portable-expath... ) go forward. REx is prepared for that, a Haxe code generator for is already available now. Just needs to be fitted to the completed Expath Haxe API, whenever this happens.
Sorry for any inconvenience that this may have caused.
Best regards Gunther
On 29.03.2016 17:24, Christian Grün wrote:
Hi Gunther,
Thanks for your mail. After some research, I don’t see a quick way to statically infer the type in the function you mentioned, mostly because it’s called recursively (while statically inferring the type of the function, the return type of map2 function is requested, and it will be the type of the function signature, because the inferred type is not available yet…).
Personally, I would love to see the return types readded, not only because it speeds up BaseX, but also because I think it’s a general advantage of XQuery to have typed functions, and . Do you think there is any chance to revert the change? Did you hear sth. about the performance of MarkLogic – is the version without argument types comparable to BaseX in terms of execution time, or much faster?
All the best, Christian
in a recent update of REx parser generator (v5.37), some type specifications were removed from generated XQuery and XSLT functions. This affects tail-recursive functions, and it was done because it turned out that MarkLogic fails to optimize tail calls, presumably due to an explicit type check caused by the type specification (see http://markmail.org/message/gxi26da4crk2v5ge ).
Now it was reported that this change causes performance problems with BaseX (see https://twitter.com/apb1704/status/714219874441146368 ).
I reproduced the problem as follows:
- download the XQuery 3.1 grammar from
http://bottlecaps.de/rex/CR-xquery-31-20151217.ebnf
- generate a parser from it on http://bottlecaps.de/rex/ using command
line options:
-xquery -tree -main
run the generated parser, using this command line
basex -binput={42} CR-xquery-31-20151217.xquery
This works OK, but it takes about 80 seconds. Some analysis showed that the time can be influenced by putting back the type specification to function p:map2, i.e. declaring 'as xs:integer' as the result type:
declare function p:map2($c as xs:integer, $lo as xs:integer, $hi as
xs:integer) as xs:integer
This variant completes in less than 3 seconds. But even when declaring a return type of
as xs:integer?
which might be the type that can be inferred statically, it completes fast.
Is this possibly a problem with the optimizer?
Which variant of generated code would be preferable for BaseX - typed or untyped?
Thanks, Gunther
Hi Gunther,
Thanks for your comprehensive feedback!
- when I ran some tests with BaseX, my impression was that tail call optimization can be achieved for p:transition. Not sure, though.
Out of interest, I checked our query plan (using -x on command-line). As you already guessed, p:transition is tail-call optimized, as well as p:transition, p:matchW and p:token (and various other functions), no matter if return types are used or not.
- to be honest, I am not convinced that XQuery is the ideal language for coding parsers. I was just exploring whether it could be done, when I started on this, and I found that once you have a parse tree in XDM, you can do a lot of interesting tasks on it.
Yes, it works very well, and it’s interesting to see how many people use it for all kinds of things. In practice, it’s mostly the parsing time that takes time, so I hope we’ll soon manage to keep pre-compiled XQuery modules in main-memory. We are working on that.
- if anybody was interested, I could generate Java (or other) extension functions for some XQuery processor, like I do for Saxon.
This sounds interesting indeed! Could you provide us with a link to the Saxon extension functions?
That would combine high performance with ease of use. I would love to see Adam Retter's proposal for portable Expath extension functions (
http://www.adamretter.org.uk/presentations/implementation-of-portable-expath... ) go forward.
Me too. I’m just not sure how much time it would cost us to make it happen. I’ll put it on my mental queue…
Sorry for any inconvenience that this may have caused.
No reason to be sorry. Issue like that are a perfect chance to further improve our processor.
Christian
Thanks, this works (I overlooked the "configure" checkbox on the web page); I could successfully compile and run your code.
The numerous imports in the Java source file seem to indicate that it won’t be straightforward to do something similar for BaseX, right? But maybe I’m wrong – if you have time to give it a try, all your questions will be welcome!
Christian
On Wed, Mar 30, 2016 at 5:57 PM, Gunther Rademacher grd@gmx.net wrote:
Hi Christian,
This sounds interesting indeed! Could you provide us with a link to the Saxon extension functions?
it's an option when generating a parser in Java on the REx website, http://bottlecaps.de/rex/
For parsing XQuery from XQuery via a Saxon extension function,
- download the XQuery grammar, e.g.
http://bottlecaps.de/rex/CR-xquery-31-20151217.ebnf
generate a Java-coded parser from it, using these command line options
-java -tree -saxon
compile the result
javac CR_xquery_31_20151217.java
register the extension function, e.g. using the generated Saxon
initializer class:
java net.sf.saxon.Query
-init:CR_xquery_31_20151217$SaxonInitializer -qs:"declare namespace p='CR_xquery_31_20151217'; p:parse-XQuery('42')" !indent=yes
The Saxon API that is used for building the parse tree has changed recently, so be sure to use Saxon 9.7.0.4 or later.
Best regards Gunther
Hi Gunther, hi all,
here is a straightforward (yet I somewhat hacky) way to invoke the ReX Java parser code from XQuery:
1. download the XQuery grammar, e.g. http://bottlecaps.de/rex/CR-xquery-31-20151217.ebnf
2. generate a Java-coded parser from it, using these command line options -java -tree -main
3. compile the result javac CR_xquery_31_20151217.java
4. run the attached XQuery files with BaseX or Saxon EE, and with the compiled parser classes in the classpath, e.g.:
java -cp BaseX.jar;. org.basex.BaseX parse-xquery.xq java -cp saxon9ee.jar;. net.sf.saxon.Query parse-xquery.xq
(The semicolon must be replaced with a colon on Unix/Linux-based systems).
In BaseX, for simple inputs, the compiled tree will be available in 5-10 ms. I assume it could be even faster when embedding some native BaseX code in the ReX Parser Generator; but I don’t know how much effort this will be?
Hope this helps, feedback is welcome, Christian
Hi Christian,
great - and fairly simple.
My proposal would have been to implement a DOM tree builder on the EventHandler interface (where you use XmlSerializer), and that would not require much effort either. It need not necessarily be done in the generated code, but could as well be implemented on top.
I am busy right now, but will be able to present some code tonight.
Is there a different tree model than DOM, that you would prefer for BaseX?
By the way, the generated Saxon imports serve two purposes:
- adapting to the extension function API (necessary when using Saxon-HE) - using Saxon's native tree builder.
Best regards Gunther
Hi Gunther,
I am busy right now, but will be able to present some code tonight.
Thanks! Take your time.
Is there a different tree model than DOM, that you would prefer for BaseX?
I assume that the difference between DOM and String inputs will be marginal. If the method will be called from XQuery, one the fastest solutions is probably to write everything to a temporary string or byte array and create an XQuery node representation (which is an instance of DBNode in BaseX):
import org.basex.io.IO; import org.basex.query.value.node.DBNode;
static DBNode parseXml() throws Exception { String input = "<xml/>"; return new DBNode(IO.get(input)); }
Thinking about this, I noticed that my previous parse-xquery.xq example will be executed faster (from 5ms to 2ms if executed repeatedly) if fn:parse-xml is replaced with with fn:parse-xml-fragment. This is why our internal XML parser instead of Java’s default XML parser is used for the second function.
So this version is probably the best (it is more than 10 times faster than version 1 for small XML documents):
import org.basex.build.xml.XMLParser; import org.basex.core.MainOptions; import org.basex.io.IO; import org.basex.query.value.node.DBNode;
static DBNode parseXml() throws Exception { String input = "<x/>"; XMLParser parser = new XMLParser(IO.get(input), MainOptions.get()); return new DBNode(parser); }
But I’m wondering who’ll eventually care about the difference ;)
Christian
By the way, the generated Saxon imports serve two purposes:
- adapting to the extension function API (necessary when using Saxon-HE)
- using Saxon's native tree builder.
Best regards Gunther --
Gesendet: Donnerstag, 31. März 2016 um 11:14 Uhr Von: "Christian Grün" christian.gruen@gmail.com An: "Gunther Rademacher" grd@gmx.net Cc: BaseX basex-talk@mailman.uni-konstanz.de Betreff: Re: Re: [basex-talk] BaseX optimizer performance on REx-generated parser Hi Gunther, hi all,
here is a straightforward (yet I somewhat hacky) way to invoke the ReX Java parser code from XQuery:
- download the XQuery grammar, e.g.
http://bottlecaps.de/rex/CR-xquery-31-20151217.ebnf
- generate a Java-coded parser from it, using these command line options
-java -tree -main
- compile the result
javac CR_xquery_31_20151217.java
- run the attached XQuery files with BaseX or Saxon EE, and with the
compiled parser classes in the classpath, e.g.:
java -cp BaseX.jar;. org.basex.BaseX parse-xquery.xq java -cp saxon9ee.jar;. net.sf.saxon.Query parse-xquery.xq
(The semicolon must be replaced with a colon on Unix/Linux-based systems).
In BaseX, for simple inputs, the compiled tree will be available in 5-10 ms. I assume it could be even faster when embedding some native BaseX code in the ReX Parser Generator; but I don’t know how much effort this will be?
Hope this helps, feedback is welcome, Christian
Very nice. I have hacked a ex-path packaged version as a POC...
Install
" https://github.com/expkg-zone58/ex-xparse/releases/download/v0.2.1/xq-xparse... " !repo:install(.)
Run
import module namespace xq="test.org/ns/xq-parse";
"http://www.xqueryfunctions.com/xq/functx-1.0-doc-2007-01.xq" !unparsed-text(.) !xq:parse(.,map{"flatten":true()})
The parsing time is about 250ms for me and no stack overflow. /Andy
On 31 March 2016 at 14:01, Christian Grün christian.gruen@gmail.com wrote:
Hi Gunther,
I am busy right now, but will be able to present some code tonight.
Thanks! Take your time.
Is there a different tree model than DOM, that you would prefer for
BaseX?
I assume that the difference between DOM and String inputs will be marginal. If the method will be called from XQuery, one the fastest solutions is probably to write everything to a temporary string or byte array and create an XQuery node representation (which is an instance of DBNode in BaseX):
import org.basex.io.IO; import org.basex.query.value.node.DBNode;
static DBNode parseXml() throws Exception { String input = "<xml/>"; return new DBNode(IO.get(input)); }
Thinking about this, I noticed that my previous parse-xquery.xq example will be executed faster (from 5ms to 2ms if executed repeatedly) if fn:parse-xml is replaced with with fn:parse-xml-fragment. This is why our internal XML parser instead of Java’s default XML parser is used for the second function.
So this version is probably the best (it is more than 10 times faster than version 1 for small XML documents):
import org.basex.build.xml.XMLParser; import org.basex.core.MainOptions; import org.basex.io.IO; import org.basex.query.value.node.DBNode;
static DBNode parseXml() throws Exception { String input = "<x/>"; XMLParser parser = new XMLParser(IO.get(input), MainOptions.get()); return new DBNode(parser); }
But I’m wondering who’ll eventually care about the difference ;)
Christian
By the way, the generated Saxon imports serve two purposes:
- adapting to the extension function API (necessary when using
Saxon-HE)
- using Saxon's native tree builder.
Best regards Gunther --
Gesendet: Donnerstag, 31. März 2016 um 11:14 Uhr Von: "Christian Grün" christian.gruen@gmail.com An: "Gunther Rademacher" grd@gmx.net Cc: BaseX basex-talk@mailman.uni-konstanz.de Betreff: Re: Re: [basex-talk] BaseX optimizer performance on
REx-generated parser
Hi Gunther, hi all,
here is a straightforward (yet I somewhat hacky) way to invoke the ReX Java parser code from XQuery:
- download the XQuery grammar, e.g.
http://bottlecaps.de/rex/CR-xquery-31-20151217.ebnf
- generate a Java-coded parser from it, using these command line options
-java -tree -main
- compile the result
javac CR_xquery_31_20151217.java
- run the attached XQuery files with BaseX or Saxon EE, and with the
compiled parser classes in the classpath, e.g.:
java -cp BaseX.jar;. org.basex.BaseX parse-xquery.xq java -cp saxon9ee.jar;. net.sf.saxon.Query parse-xquery.xq
(The semicolon must be replaced with a colon on Unix/Linux-based
systems).
In BaseX, for simple inputs, the compiled tree will be available in 5-10 ms. I assume it could be even faster when embedding some native BaseX code in the ReX Parser Generator; but I don’t know how much effort this will be?
Hope this helps, feedback is welcome, Christian
Hi Christian,
please find my code attached. I have tested it along with an XQuery 3.1 parser, that was generated using command line options:
-tree -main -java -saxon It contains the DOM tree builder, as well as your approach using XmlSerializer followed by XML parsing, both for BaseX and for Saxon.
In my tests I have parsed the XQuery code for the same grammar, roughly 1 MB, and counted nodes of the parse tree.
These are the commands that I have used:
java org.basex.BaseX -q "declare namespace p='java:XQueryParser'; p:parseXQueryToDOM(unparsed-text('file:///C:/temp/CR-xquery-31-20151217.xquery'))/count(descendant-or-self::node())" java org.basex.BaseX -q "declare namespace p='java:XQueryParser'; p:parseXQueryToDBNode(unparsed-text('file:///C:/temp/CR-xquery-31-20151217.xquery'))/count(descendant-or-self::node())" java net.sf.saxon.Query -qs:"declare namespace p='java:XQueryParser'; p:parseXQueryToDOM(unparsed-text('file:///C:/temp/CR-xquery-31-20151217.xquery'))/count(descendant-or-self::node())" java net.sf.saxon.Query -init:XQueryParser$SaxonInitializer -qs:"declare namespace p='XQueryParser'; p:parseXQueryToNodeInfo(unparsed-text('file:///C:/temp/CR-xquery-31-20151217.xquery'))/count(descendant-or-self::node())" java net.sf.saxon.Query -init:CR_xquery_31_20151217$SaxonInitializer -qs:"declare namespace p='CR_xquery_31_20151217'; p:parse-XQuery(unparsed-text('file:///C:/temp/CR-xquery-31-20151217.xquery'))/count(descendant-or-self::node())"
And here are the results (best runtime in seconds out of several executions):
| BaseX | SaxonEE ---------------+-----------+------------ DOM builder | 4.48 | 2.98 parseXml | 3.57 | 3.24 native builder | - | 2.36
As you expected, using DOM seems not to be advantageous for BaseX. However the Saxon results suggest that a native tree builder API can do better than parsing XML.
Best regards Gunther
Gesendet: Donnerstag, 31. März 2016 um 15:01 Uhr Von: "Christian Grün" christian.gruen@gmail.com An: "Gunther Rademacher" grd@gmx.net Cc: BaseX basex-talk@mailman.uni-konstanz.de Betreff: Re: [basex-talk] BaseX optimizer performance on REx-generated parser Hi Gunther,
I am busy right now, but will be able to present some code tonight.
Thanks! Take your time.
Is there a different tree model than DOM, that you would prefer for BaseX?
I assume that the difference between DOM and String inputs will be marginal. If the method will be called from XQuery, one the fastest solutions is probably to write everything to a temporary string or byte array and create an XQuery node representation (which is an instance of DBNode in BaseX):
import org.basex.io.IO; import org.basex.query.value.node.DBNode;
static DBNode parseXml() throws Exception { String input = "<xml/>"; return new DBNode(IO.get(input)); }
Thinking about this, I noticed that my previous parse-xquery.xq example will be executed faster (from 5ms to 2ms if executed repeatedly) if fn:parse-xml is replaced with with fn:parse-xml-fragment. This is why our internal XML parser instead of Java’s default XML parser is used for the second function.
So this version is probably the best (it is more than 10 times faster than version 1 for small XML documents):
import org.basex.build.xml.XMLParser; import org.basex.core.MainOptions; import org.basex.io.IO; import org.basex.query.value.node.DBNode;
static DBNode parseXml() throws Exception { String input = "<x/>"; XMLParser parser = new XMLParser(IO.get(input), MainOptions.get()); return new DBNode(parser); }
But I’m wondering who’ll eventually care about the difference ;)
Christian
By the way, the generated Saxon imports serve two purposes:
- adapting to the extension function API (necessary when using Saxon-HE)
- using Saxon's native tree builder.
Best regards Gunther --
Gesendet: Donnerstag, 31. März 2016 um 11:14 Uhr Von: "Christian Grün" christian.gruen@gmail.com An: "Gunther Rademacher" grd@gmx.net Cc: BaseX basex-talk@mailman.uni-konstanz.de Betreff: Re: Re: [basex-talk] BaseX optimizer performance on REx-generated parser Hi Gunther, hi all,
here is a straightforward (yet I somewhat hacky) way to invoke the ReX Java parser code from XQuery:
- download the XQuery grammar, e.g.
http://bottlecaps.de/rex/CR-xquery-31-20151217.ebnf
- generate a Java-coded parser from it, using these command line options
-java -tree -main
- compile the result
javac CR_xquery_31_20151217.java
- run the attached XQuery files with BaseX or Saxon EE, and with the
compiled parser classes in the classpath, e.g.:
java -cp BaseX.jar;. org.basex.BaseX parse-xquery.xq java -cp saxon9ee.jar;. net.sf.saxon.Query parse-xquery.xq
(The semicolon must be replaced with a colon on Unix/Linux-based systems).
In BaseX, for simple inputs, the compiled tree will be available in 5-10 ms. I assume it could be even faster when embedding some native BaseX code in the ReX Parser Generator; but I don’t know how much effort this will be?
Hope this helps, feedback is welcome, Christian
Hi Gunther,
Thanks again! Thanks to your examples, which create 38 MB of serialized XML, I now see why it is in fact beneficial to use a tree builder ;)
I finally looked at your Saxon code a bit closer, and I rewrote it a bit to work with BaseX:
* I added a parse(String query) function, which basically does what ExtensionFunctionCall.call does * I renamed SaxonTreeBuilder to BaseXTreeBuilder, which now calls the appropriate BaseX builder functions * The TopDownTreeBuilder stays unchanged
I have attached the resulting code; it seems to be much faster indeed. Does it make any sense to you? Do you think it would make sense to provide both a Saxon and BaseX option on your parser page?
Christian
On Fri, Apr 1, 2016 at 12:32 AM, Gunther Rademacher grd@gmx.net wrote:
Hi Christian,
please find my code attached. I have tested it along with an XQuery 3.1 parser, that was generated using command line options:
-tree -main -java -saxon
It contains the DOM tree builder, as well as your approach using XmlSerializer followed by XML parsing, both for BaseX and for Saxon.
In my tests I have parsed the XQuery code for the same grammar, roughly 1 MB, and counted nodes of the parse tree.
These are the commands that I have used:
java org.basex.BaseX -q "declare namespace p='java:XQueryParser'; p:parseXQueryToDOM(unparsed-text('file:///C:/temp/CR-xquery-31-20151217.xquery'))/count(descendant-or-self::node())" java org.basex.BaseX -q "declare namespace p='java:XQueryParser'; p:parseXQueryToDBNode(unparsed-text('file:///C:/temp/CR-xquery-31-20151217.xquery'))/count(descendant-or-self::node())" java net.sf.saxon.Query -qs:"declare namespace p='java:XQueryParser'; p:parseXQueryToDOM(unparsed-text('file:///C:/temp/CR-xquery-31-20151217.xquery'))/count(descendant-or-self::node())" java net.sf.saxon.Query -init:XQueryParser$SaxonInitializer -qs:"declare namespace p='XQueryParser'; p:parseXQueryToNodeInfo(unparsed-text('file:///C:/temp/CR-xquery-31-20151217.xquery'))/count(descendant-or-self::node())" java net.sf.saxon.Query -init:CR_xquery_31_20151217$SaxonInitializer -qs:"declare namespace p='CR_xquery_31_20151217'; p:parse-XQuery(unparsed-text('file:///C:/temp/CR-xquery-31-20151217.xquery'))/count(descendant-or-self::node())"
And here are the results (best runtime in seconds out of several executions):
| BaseX | SaxonEE ---------------+-----------+------------ DOM builder | 4.48 | 2.98 parseXml | 3.57 | 3.24 native builder | - | 2.36
As you expected, using DOM seems not to be advantageous for BaseX. However the Saxon results suggest that a native tree builder API can do better than parsing XML.
Best regards Gunther
Gesendet: Donnerstag, 31. März 2016 um 15:01 Uhr Von: "Christian Grün" christian.gruen@gmail.com An: "Gunther Rademacher" grd@gmx.net Cc: BaseX basex-talk@mailman.uni-konstanz.de Betreff: Re: [basex-talk] BaseX optimizer performance on REx-generated parser Hi Gunther,
I am busy right now, but will be able to present some code tonight.
Thanks! Take your time.
Is there a different tree model than DOM, that you would prefer for BaseX?
I assume that the difference between DOM and String inputs will be marginal. If the method will be called from XQuery, one the fastest solutions is probably to write everything to a temporary string or byte array and create an XQuery node representation (which is an instance of DBNode in BaseX):
import org.basex.io.IO; import org.basex.query.value.node.DBNode;
static DBNode parseXml() throws Exception { String input = "<xml/>"; return new DBNode(IO.get(input)); }
Thinking about this, I noticed that my previous parse-xquery.xq example will be executed faster (from 5ms to 2ms if executed repeatedly) if fn:parse-xml is replaced with with fn:parse-xml-fragment. This is why our internal XML parser instead of Java’s default XML parser is used for the second function.
So this version is probably the best (it is more than 10 times faster than version 1 for small XML documents):
import org.basex.build.xml.XMLParser; import org.basex.core.MainOptions; import org.basex.io.IO; import org.basex.query.value.node.DBNode;
static DBNode parseXml() throws Exception { String input = "<x/>"; XMLParser parser = new XMLParser(IO.get(input), MainOptions.get()); return new DBNode(parser); }
But I’m wondering who’ll eventually care about the difference ;)
Christian
By the way, the generated Saxon imports serve two purposes:
- adapting to the extension function API (necessary when using Saxon-HE)
- using Saxon's native tree builder.
Best regards Gunther --
Gesendet: Donnerstag, 31. März 2016 um 11:14 Uhr Von: "Christian Grün" christian.gruen@gmail.com An: "Gunther Rademacher" grd@gmx.net Cc: BaseX basex-talk@mailman.uni-konstanz.de Betreff: Re: Re: [basex-talk] BaseX optimizer performance on REx-generated parser Hi Gunther, hi all,
here is a straightforward (yet I somewhat hacky) way to invoke the ReX Java parser code from XQuery:
- download the XQuery grammar, e.g.
http://bottlecaps.de/rex/CR-xquery-31-20151217.ebnf
- generate a Java-coded parser from it, using these command line options
-java -tree -main
- compile the result
javac CR_xquery_31_20151217.java
- run the attached XQuery files with BaseX or Saxon EE, and with the
compiled parser classes in the classpath, e.g.:
java -cp BaseX.jar;. org.basex.BaseX parse-xquery.xq java -cp saxon9ee.jar;. net.sf.saxon.Query parse-xquery.xq
(The semicolon must be replaced with a colon on Unix/Linux-based systems).
In BaseX, for simple inputs, the compiled tree will be available in 5-10 ms. I assume it could be even faster when embedding some native BaseX code in the ReX Parser Generator; but I don’t know how much effort this will be?
Hope this helps, feedback is welcome, Christian
Hi Christian,
thank you for the tree builder proposal, it works fine indeed.
I have slightly modified the extension function such that it behaves the same as generated XQuery code, so can be used to replace it without further adaptations of the code that calls it.
Also I have used Str rather than String, in order to create a unique signature identifying a BaseX extension function.
Finally, the call of the parser's parse_x method was isolated in order to prepare for multiple extension functions in a single class. This occurs when there are multiple start symbols in a grammar.
The modified code is attached to this mail. It is stripped down to what would be added to REx-generated code for '-basex'.
Best regards Gunther
Gesendet: Freitag, 01. April 2016 um 17:57 Uhr Von: "Christian Grün" christian.gruen@gmail.com An: "Gunther Rademacher" grd@gmx.net Cc: BaseX basex-talk@mailman.uni-konstanz.de Betreff: Re: [basex-talk] BaseX optimizer performance on REx-generated parser Hi Gunther,
Thanks again! Thanks to your examples, which create 38 MB of serialized XML, I now see why it is in fact beneficial to use a tree builder ;)
I finally looked at your Saxon code a bit closer, and I rewrote it a bit to work with BaseX:
* I added a parse(String query) function, which basically does what ExtensionFunctionCall.call does * I renamed SaxonTreeBuilder to BaseXTreeBuilder, which now calls the appropriate BaseX builder functions * The TopDownTreeBuilder stays unchanged
I have attached the resulting code; it seems to be much faster indeed. Does it make any sense to you? Do you think it would make sense to provide both a Saxon and BaseX option on your parser page?
Christian
On Fri, Apr 1, 2016 at 12:32 AM, Gunther Rademacher grd@gmx.net wrote:
Hi Christian,
please find my code attached. I have tested it along with an XQuery 3.1 parser, that was generated using command line options:
-tree -main -java -saxon
It contains the DOM tree builder, as well as your approach using XmlSerializer followed by XML parsing, both for BaseX and for Saxon.
In my tests I have parsed the XQuery code for the same grammar, roughly 1 MB, and counted nodes of the parse tree.
These are the commands that I have used:
java org.basex.BaseX -q "declare namespace p='java:XQueryParser'; p:parseXQueryToDOM(unparsed-text('file:///C:/temp/CR-xquery-31-20151217.xquery'))/count(descendant-or-self::node())" java org.basex.BaseX -q "declare namespace p='java:XQueryParser'; p:parseXQueryToDBNode(unparsed-text('file:///C:/temp/CR-xquery-31-20151217.xquery'))/count(descendant-or-self::node())" java net.sf.saxon.Query -qs:"declare namespace p='java:XQueryParser'; p:parseXQueryToDOM(unparsed-text('file:///C:/temp/CR-xquery-31-20151217.xquery'))/count(descendant-or-self::node())" java net.sf.saxon.Query -init:XQueryParser$SaxonInitializer -qs:"declare namespace p='XQueryParser'; p:parseXQueryToNodeInfo(unparsed-text('file:///C:/temp/CR-xquery-31-20151217.xquery'))/count(descendant-or-self::node())" java net.sf.saxon.Query -init:CR_xquery_31_20151217$SaxonInitializer -qs:"declare namespace p='CR_xquery_31_20151217'; p:parse-XQuery(unparsed-text('file:///C:/temp/CR-xquery-31-20151217.xquery'))/count(descendant-or-self::node())"
And here are the results (best runtime in seconds out of several executions):
| BaseX | SaxonEE ---------------+-----------+------------ DOM builder | 4.48 | 2.98 parseXml | 3.57 | 3.24 native builder | - | 2.36
As you expected, using DOM seems not to be advantageous for BaseX. However the Saxon results suggest that a native tree builder API can do better than parsing XML.
Best regards Gunther
Gesendet: Donnerstag, 31. März 2016 um 15:01 Uhr Von: "Christian Grün" christian.gruen@gmail.com An: "Gunther Rademacher" grd@gmx.net Cc: BaseX basex-talk@mailman.uni-konstanz.de Betreff: Re: [basex-talk] BaseX optimizer performance on REx-generated parser Hi Gunther,
I am busy right now, but will be able to present some code tonight.
Thanks! Take your time.
Is there a different tree model than DOM, that you would prefer for BaseX?
I assume that the difference between DOM and String inputs will be marginal. If the method will be called from XQuery, one the fastest solutions is probably to write everything to a temporary string or byte array and create an XQuery node representation (which is an instance of DBNode in BaseX):
import org.basex.io.IO; import org.basex.query.value.node.DBNode;
static DBNode parseXml() throws Exception { String input = "<xml/>"; return new DBNode(IO.get(input)); }
Thinking about this, I noticed that my previous parse-xquery.xq example will be executed faster (from 5ms to 2ms if executed repeatedly) if fn:parse-xml is replaced with with fn:parse-xml-fragment. This is why our internal XML parser instead of Java’s default XML parser is used for the second function.
So this version is probably the best (it is more than 10 times faster than version 1 for small XML documents):
import org.basex.build.xml.XMLParser; import org.basex.core.MainOptions; import org.basex.io.IO; import org.basex.query.value.node.DBNode;
static DBNode parseXml() throws Exception { String input = "<x/>"; XMLParser parser = new XMLParser(IO.get(input), MainOptions.get()); return new DBNode(parser); }
But I’m wondering who’ll eventually care about the difference ;)
Christian
By the way, the generated Saxon imports serve two purposes:
- adapting to the extension function API (necessary when using Saxon-HE)
- using Saxon's native tree builder.
Best regards Gunther --
Gesendet: Donnerstag, 31. März 2016 um 11:14 Uhr Von: "Christian Grün" christian.gruen@gmail.com An: "Gunther Rademacher" grd@gmx.net Cc: BaseX basex-talk@mailman.uni-konstanz.de Betreff: Re: Re: [basex-talk] BaseX optimizer performance on REx-generated parser Hi Gunther, hi all,
here is a straightforward (yet I somewhat hacky) way to invoke the ReX Java parser code from XQuery:
- download the XQuery grammar, e.g.
http://bottlecaps.de/rex/CR-xquery-31-20151217.ebnf
- generate a Java-coded parser from it, using these command line options
-java -tree -main
- compile the result
javac CR_xquery_31_20151217.java
- run the attached XQuery files with BaseX or Saxon EE, and with the
compiled parser classes in the classpath, e.g.:
java -cp BaseX.jar;. org.basex.BaseX parse-xquery.xq java -cp saxon9ee.jar;. net.sf.saxon.Query parse-xquery.xq
(The semicolon must be replaced with a colon on Unix/Linux-based systems).
In BaseX, for simple inputs, the compiled tree will be available in 5-10 ms. I assume it could be even faster when embedding some native BaseX code in the ReX Parser Generator; but I don’t know how much effort this will be?
Hope this helps, feedback is welcome, Christian
Hi Gunther,
Your code looks excellent. And I was amazed to see you’ve even found out how to build the error node via our MemBuilder code (which is not really well-documented, compared to real APIs)… I’ll be happy to play around with it and give more feedback once the -basex option will be available.
Thanks! Christian
On Mon, Apr 4, 2016 at 8:07 AM, Gunther Rademacher grd@gmx.net wrote:
Hi Christian,
thank you for the tree builder proposal, it works fine indeed.
I have slightly modified the extension function such that it behaves the same as generated XQuery code, so can be used to replace it without further adaptations of the code that calls it.
Also I have used Str rather than String, in order to create a unique signature identifying a BaseX extension function.
Finally, the call of the parser's parse_x method was isolated in order to prepare for multiple extension functions in a single class. This occurs when there are multiple start symbols in a grammar.
The modified code is attached to this mail. It is stripped down to what would be added to REx-generated code for '-basex'.
Best regards Gunther
Gesendet: Freitag, 01. April 2016 um 17:57 Uhr Von: "Christian Grün" christian.gruen@gmail.com An: "Gunther Rademacher" grd@gmx.net Cc: BaseX basex-talk@mailman.uni-konstanz.de Betreff: Re: [basex-talk] BaseX optimizer performance on REx-generated parser Hi Gunther,
Thanks again! Thanks to your examples, which create 38 MB of serialized XML, I now see why it is in fact beneficial to use a tree builder ;)
I finally looked at your Saxon code a bit closer, and I rewrote it a bit to work with BaseX:
- I added a parse(String query) function, which basically does what
ExtensionFunctionCall.call does
- I renamed SaxonTreeBuilder to BaseXTreeBuilder, which now calls the
appropriate BaseX builder functions
- The TopDownTreeBuilder stays unchanged
I have attached the resulting code; it seems to be much faster indeed. Does it make any sense to you? Do you think it would make sense to provide both a Saxon and BaseX option on your parser page?
Christian
On Fri, Apr 1, 2016 at 12:32 AM, Gunther Rademacher grd@gmx.net wrote:
Hi Christian,
please find my code attached. I have tested it along with an XQuery 3.1 parser, that was generated using command line options:
-tree -main -java -saxon
It contains the DOM tree builder, as well as your approach using XmlSerializer followed by XML parsing, both for BaseX and for Saxon.
In my tests I have parsed the XQuery code for the same grammar, roughly 1 MB, and counted nodes of the parse tree.
These are the commands that I have used:
java org.basex.BaseX -q "declare namespace p='java:XQueryParser'; p:parseXQueryToDOM(unparsed-text('file:///C:/temp/CR-xquery-31-20151217.xquery'))/count(descendant-or-self::node())" java org.basex.BaseX -q "declare namespace p='java:XQueryParser'; p:parseXQueryToDBNode(unparsed-text('file:///C:/temp/CR-xquery-31-20151217.xquery'))/count(descendant-or-self::node())" java net.sf.saxon.Query -qs:"declare namespace p='java:XQueryParser'; p:parseXQueryToDOM(unparsed-text('file:///C:/temp/CR-xquery-31-20151217.xquery'))/count(descendant-or-self::node())" java net.sf.saxon.Query -init:XQueryParser$SaxonInitializer -qs:"declare namespace p='XQueryParser'; p:parseXQueryToNodeInfo(unparsed-text('file:///C:/temp/CR-xquery-31-20151217.xquery'))/count(descendant-or-self::node())" java net.sf.saxon.Query -init:CR_xquery_31_20151217$SaxonInitializer -qs:"declare namespace p='CR_xquery_31_20151217'; p:parse-XQuery(unparsed-text('file:///C:/temp/CR-xquery-31-20151217.xquery'))/count(descendant-or-self::node())"
And here are the results (best runtime in seconds out of several executions):
| BaseX | SaxonEE ---------------+-----------+------------ DOM builder | 4.48 | 2.98 parseXml | 3.57 | 3.24 native builder | - | 2.36
As you expected, using DOM seems not to be advantageous for BaseX. However the Saxon results suggest that a native tree builder API can do better than parsing XML.
Best regards Gunther
Gesendet: Donnerstag, 31. März 2016 um 15:01 Uhr Von: "Christian Grün" christian.gruen@gmail.com An: "Gunther Rademacher" grd@gmx.net Cc: BaseX basex-talk@mailman.uni-konstanz.de Betreff: Re: [basex-talk] BaseX optimizer performance on REx-generated parser Hi Gunther,
I am busy right now, but will be able to present some code tonight.
Thanks! Take your time.
Is there a different tree model than DOM, that you would prefer for BaseX?
I assume that the difference between DOM and String inputs will be marginal. If the method will be called from XQuery, one the fastest solutions is probably to write everything to a temporary string or byte array and create an XQuery node representation (which is an instance of DBNode in BaseX):
import org.basex.io.IO; import org.basex.query.value.node.DBNode;
static DBNode parseXml() throws Exception { String input = "<xml/>"; return new DBNode(IO.get(input)); }
Thinking about this, I noticed that my previous parse-xquery.xq example will be executed faster (from 5ms to 2ms if executed repeatedly) if fn:parse-xml is replaced with with fn:parse-xml-fragment. This is why our internal XML parser instead of Java’s default XML parser is used for the second function.
So this version is probably the best (it is more than 10 times faster than version 1 for small XML documents):
import org.basex.build.xml.XMLParser; import org.basex.core.MainOptions; import org.basex.io.IO; import org.basex.query.value.node.DBNode;
static DBNode parseXml() throws Exception { String input = "<x/>"; XMLParser parser = new XMLParser(IO.get(input), MainOptions.get()); return new DBNode(parser); }
But I’m wondering who’ll eventually care about the difference ;)
Christian
By the way, the generated Saxon imports serve two purposes:
- adapting to the extension function API (necessary when using Saxon-HE)
- using Saxon's native tree builder.
Best regards Gunther --
Gesendet: Donnerstag, 31. März 2016 um 11:14 Uhr Von: "Christian Grün" christian.gruen@gmail.com An: "Gunther Rademacher" grd@gmx.net Cc: BaseX basex-talk@mailman.uni-konstanz.de Betreff: Re: Re: [basex-talk] BaseX optimizer performance on REx-generated parser Hi Gunther, hi all,
here is a straightforward (yet I somewhat hacky) way to invoke the ReX Java parser code from XQuery:
- download the XQuery grammar, e.g.
http://bottlecaps.de/rex/CR-xquery-31-20151217.ebnf
- generate a Java-coded parser from it, using these command line options
-java -tree -main
- compile the result
javac CR_xquery_31_20151217.java
- run the attached XQuery files with BaseX or Saxon EE, and with the
compiled parser classes in the classpath, e.g.:
java -cp BaseX.jar;. org.basex.BaseX parse-xquery.xq java -cp saxon9ee.jar;. net.sf.saxon.Query parse-xquery.xq
(The semicolon must be replaced with a colon on Unix/Linux-based systems).
In BaseX, for simple inputs, the compiled tree will be available in 5-10 ms. I assume it could be even faster when embedding some native BaseX code in the ReX Parser Generator; but I don’t know how much effort this will be?
Hope this helps, feedback is welcome, Christian
Hi Christian,
thanks for reviewing the code!
REx v5.38 is now online and it includes the -basex option. Use it as follows:
- create some grammar G.ebnf, e.g. by using a sample grammar from http://bottlecaps.de/rex/ - create a parser by generating it on http://bottlecaps.de/rex/ using command line options: -tree -java -basex - compile G.java and make class files available to BaseX classpath
- invoke parser from XQuery, e.g.
declare namespace p="G"; p:parse-S($input)
where S is a start symbol of the grammar.
- for an XQuery coded parser, use REx options: -tree -xquery
- for just a syntax checker, with no parse tree, omit option: -tree
If you encounter any problems with this, please let me know.
Also please note that I have restored the type of p:match2 in generated XQuery code, which cause bad performance problem with earlier BaseX versions (no longer with 8.4.2). p:transition and some others still come without a type spec, but they are not known to cause any problem and this saves me adding another case distinction to code generation.
Best regards Gunther
Gesendet: Montag, 04. April 2016 um 19:36 Uhr Von: "Christian Grün" christian.gruen@gmail.com An: "Gunther Rademacher" grd@gmx.net Cc: BaseX basex-talk@mailman.uni-konstanz.de Betreff: Re: [basex-talk] BaseX optimizer performance on REx-generated parser Hi Gunther,
Your code looks excellent. And I was amazed to see you’ve even found out how to build the error node via our MemBuilder code (which is not really well-documented, compared to real APIs)… I’ll be happy to play around with it and give more feedback once the -basex option will be available.
Thanks! Christian
…works fine!
On Mon, Apr 4, 2016 at 8:57 PM, Gunther Rademacher grd@gmx.net wrote:
Hi Christian,
thanks for reviewing the code!
REx v5.38 is now online and it includes the -basex option. Use it as follows:
create some grammar G.ebnf, e.g. by using a sample grammar from http://bottlecaps.de/rex/
create a parser by generating it on http://bottlecaps.de/rex/ using command line options: -tree -java -basex
compile G.java and make class files available to BaseX classpath
invoke parser from XQuery, e.g.
declare namespace p="G"; p:parse-S($input)
where S is a start symbol of the grammar.
for an XQuery coded parser, use REx options: -tree -xquery
for just a syntax checker, with no parse tree, omit option: -tree
If you encounter any problems with this, please let me know.
Also please note that I have restored the type of p:match2 in generated XQuery code, which cause bad performance problem with earlier BaseX versions (no longer with 8.4.2). p:transition and some others still come without a type spec, but they are not known to cause any problem and this saves me adding another case distinction to code generation.
Best regards Gunther
Gesendet: Montag, 04. April 2016 um 19:36 Uhr Von: "Christian Grün" christian.gruen@gmail.com An: "Gunther Rademacher" grd@gmx.net Cc: BaseX basex-talk@mailman.uni-konstanz.de Betreff: Re: [basex-talk] BaseX optimizer performance on REx-generated parser Hi Gunther,
Your code looks excellent. And I was amazed to see you’ve even found out how to build the error node via our MemBuilder code (which is not really well-documented, compared to real APIs)… I’ll be happy to play around with it and give more feedback once the -basex option will be available.
Thanks! Christian
basex-talk@mailman.uni-konstanz.de