Re: [basex-talk] BaseX optimizer performance on REx-generated parser

29 Mar 2016

      Hi Christian,
thank you for your analysis. Here is my view of this:
- generally I would second your preference for typed code. It feels
     cleaner to me, and that's why the type specifications were there
     originally.
- tail call optimization is crucial for REx generated code. It is
     necessary to iteratively assemble result sequences while
     tokenizing/parsing, where these iterations require maintenance
     of the tokenizer/parser's state. The concrete tasks that use it
     are
1) assembling token value (p:transition), e.g. comment contents
       2) handling EBNF iteration (ZeroOrMore, OneOrMore) - long lists
       3) skipping over multiple whitespace items (p:matchW)
       3) creating error token lists (p:token) - limited recursion
       4) binary search (as in p:map2) - very limited recursion
The first three can hardly go without TCO, the others are rather
     limited by the number of tokens, or log2 of the number of character
     classes.
- REx code generation for XQuery was originally developed on Saxon,
     and I am pretty sure that all of the above are in favor of tail
     call optimization when run on Saxon, with or without the type
     spec.
- I have never worked with MarkLogic myself, so no experience here.
- when I ran some tests with BaseX, my impression was that tail call
     optimization can be achieved for p:transition. Not sure, though.
- when I dropped the type spec, I was hoping not to create other
     problems with it. Apparently I did, and that is reason enough
     to rollback this soon. I may create a new "ML" option to address
     their optimizer's behavior - though I'd prefer to find a common
     solution.
- to be honest, I am not convinced that XQuery is the ideal language
     for coding parsers. I was just exploring whether it could be done,
     when I started on this, and I found that once you have a parse
     tree in XDM, you can do a lot of interesting tasks on it.
- if anybody was interested, I could generate Java (or other)
     extension functions for some XQuery processor, like I do for Saxon.
     That would combine high performance with ease of use. I would love
     to see Adam Retter's proposal for portable Expath extension
     functions
     ( 
http://www.adamretter.org.uk/presentations/implementation-of-portable-expath... 
)
     go forward. REx is prepared for that, a Haxe code generator
     for is already available now. Just needs to be fitted to
     the completed Expath Haxe API, whenever this happens.
Sorry for any inconvenience that this may have caused.
Best regards
Gunther
On 29.03.2016 17:24, Christian Grün wrote:
...
Hi Gunther,
Thanks for your mail. After some research, I don’t see a quick way to
statically infer the type in the function you mentioned, mostly
because it’s called recursively (while statically inferring the type
of the function, the return type of map2 function is requested, and it
will be the type of the function signature, because the inferred type
is not available yet…).
Personally, I would love to see the return types readded, not only
because it speeds up BaseX, but also because I think it’s a general
advantage of XQuery to have typed functions, and . Do you think there
is any chance to revert the change? Did you hear sth. about the
performance of MarkLogic – is the version without argument types
comparable to BaseX in terms of execution time, or much faster?
All the best,
Christian
...
in a recent update of REx parser generator (v5.37), some type specifications
were removed from generated XQuery and XSLT functions. This affects
tail-recursive functions, and it was done because it turned out that
MarkLogic fails to optimize tail calls, presumably due to an explicit type
check caused by the type specification (see
http://markmail.org/message/gxi26da4crk2v5ge ).
Now it was reported that this change causes performance problems with BaseX
(see https://twitter.com/apb1704/status/714219874441146368 ).
I reproduced the problem as follows:

download the XQuery 3.1 grammar from

http://bottlecaps.de/rex/CR-xquery-31-20151217.ebnf

generate a parser from it on http://bottlecaps.de/rex/ using command

line options:
    -xquery -tree -main

run the generated parser, using this command line
basex -binput={42} CR-xquery-31-20151217.xquery

This works OK, but it takes about 80 seconds. Some analysis showed that the
time can be influenced by putting back the type specification to function
p:map2, i.e. declaring 'as xs:integer' as the result type:
    declare function p:map2($c as xs:integer, $lo as xs:integer, $hi as

xs:integer) as xs:integer
This variant completes in less than 3 seconds. But even when declaring a
return type of
       as xs:integer?

which might be the type that can be inferred statically, it completes fast.
Is this possibly a problem with the optimizer?
Which variant of generated code would be preferable for BaseX - typed or
untyped?
Thanks,
Gunther

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

Re: [basex-talk] BaseX optimizer performance on REx-generated parser