BaseX optimizer performance on REx-generated parser

List overview All Threads
Download

newer

older

XQuery Introduction in French

Problem with module-import via...

Gunther Rademacher

28 Mar 2016 28 Mar '16

3:16 p.m.

Hi,

in a recent update of REx parser generator (v5.37), some type specifications were removed from generated XQuery and XSLT functions. This affects tail-recursive functions, and it was done because it turned out that MarkLogic fails to optimize tail calls, presumably due to an explicit type check caused by the type specification (see http://markmail.org/message/gxi26da4crk2v5ge ).

Now it was reported that this change causes performance problems with BaseX (see https://twitter.com/apb1704/status/714219874441146368 ).

I reproduced the problem as follows:

- download the XQuery 3.1 grammar from http://bottlecaps.de/rex/CR-xquery-31-20151217.ebnf

- generate a parser from it on http://bottlecaps.de/rex/ using command line options:

-xquery -tree -main

- run the generated parser, using this command line

basex -binput={42} CR-xquery-31-20151217.xquery

This works OK, but it takes about 80 seconds. Some analysis showed that the time can be influenced by putting back the type specification to function p:map2, i.e. declaring 'as xs:integer' as the result type:

declare function p:map2($c as xs:integer, $lo as xs:integer, $hi as xs:integer) as xs:integer

This variant completes in less than 3 seconds. But even when declaring a return type of

as xs:integer?

which might be the type that can be inferred statically, it completes fast.

Is this possibly a problem with the optimizer?

Which variant of generated code would be preferable for BaseX - typed or untyped?

Thanks, Gunther

Show replies by date

Christian Grün

29 Mar 29 Mar

11:24 a.m.

Hi Gunther,

Thanks for your mail. After some research, I don’t see a quick way to statically infer the type in the function you mentioned, mostly because it’s called recursively (while statically inferring the type of the function, the return type of map2 function is requested, and it will be the type of the function signature, because the inferred type is not available yet…).

Personally, I would love to see the return types readded, not only because it speeds up BaseX, but also because I think it’s a general advantage of XQuery to have typed functions, and . Do you think there is any chance to revert the change? Did you hear sth. about the performance of MarkLogic – is the version without argument types comparable to BaseX in terms of execution time, or much faster?

All the best, Christian

...

in a recent update of REx parser generator (v5.37), some type specifications were removed from generated XQuery and XSLT functions. This affects tail-recursive functions, and it was done because it turned out that MarkLogic fails to optimize tail calls, presumably due to an explicit type check caused by the type specification (see http://markmail.org/message/gxi26da4crk2v5ge ).

Now it was reported that this change causes performance problems with BaseX (see https://twitter.com/apb1704/status/714219874441146368 ).

I reproduced the problem as follows:

download the XQuery 3.1 grammar from

http://bottlecaps.de/rex/CR-xquery-31-20151217.ebnf

generate a parser from it on http://bottlecaps.de/rex/ using command

line options:
   -xquery -tree -main
run the generated parser, using this command line

basex -binput={42} CR-xquery-31-20151217.xquery

This works OK, but it takes about 80 seconds. Some analysis showed that the time can be influenced by putting back the type specification to function p:map2, i.e. declaring 'as xs:integer' as the result type:
   declare function p:map2($c as xs:integer, $lo as xs:integer, $hi as
xs:integer) as xs:integer

This variant completes in less than 3 seconds. But even when declaring a return type of
      as xs:integer?
which might be the type that can be inferred statically, it completes fast.

Is this possibly a problem with the optimizer?

Which variant of generated code would be preferable for BaseX - typed or untyped?

Thanks, Gunther

Andy Bunce

11:54 a.m.

...

Did you hear sth. about the performance of MarkLogic

My understanding is that on ML the presence typing blocks some tail-recursion optimizations leading to stack overflow for large texts.

Stack overflow is also an issue for my use of REx with BaseX, I would like to be able to parse sources like the FunctX library without having to set a large Java stack size. [1] (My report that the issue was fixed was incorrect). I would like to investigate if BaseX is using tail-recursion optimizations here and/or where/why this overflow happens - but it all looks very complicated :-).

I suspect the users of XQuery REx are capable of hand tweaking the REx o/p either to add or remove type declarations once the issues are well documented.

/Andy

[1] http://www.mail-archive.com/basex-talk%40mailman.uni-konstanz.de/msg07212.ht...

On 29 March 2016 at 16:24, Christian Grün christian.gruen@gmail.com wrote:

...

Hi Gunther,

Thanks for your mail. After some research, I don’t see a quick way to statically infer the type in the function you mentioned, mostly because it’s called recursively (while statically inferring the type of the function, the return type of map2 function is requested, and it will be the type of the function signature, because the inferred type is not available yet…).

Personally, I would love to see the return types readded, not only because it speeds up BaseX, but also because I think it’s a general advantage of XQuery to have typed functions, and . Do you think there is any chance to revert the change? Did you hear sth. about the performance of MarkLogic – is the version without argument types comparable to BaseX in terms of execution time, or much faster?

All the best, Christian

...
in a recent update of REx parser generator (v5.37), some type

specifications

...
were removed from generated XQuery and XSLT functions. This affects tail-recursive functions, and it was done because it turned out that MarkLogic fails to optimize tail calls, presumably due to an explicit

type

...
check caused by the type specification (see http://markmail.org/message/gxi26da4crk2v5ge ).

Now it was reported that this change causes performance problems with

BaseX

...
(see https://twitter.com/apb1704/status/714219874441146368 ).

I reproduced the problem as follows:

download the XQuery 3.1 grammar from

http://bottlecaps.de/rex/CR-xquery-31-20151217.ebnf

generate a parser from it on http://bottlecaps.de/rex/ using command

line options:
   -xquery -tree -main
run the generated parser, using this command line

basex -binput={42} CR-xquery-31-20151217.xquery

This works OK, but it takes about 80 seconds. Some analysis showed that
the

...
time can be influenced by putting back the type specification to function p:map2, i.e. declaring 'as xs:integer' as the result type:
   declare function p:map2($c as xs:integer, $lo as xs:integer, $hi
as

...
xs:integer) as xs:integer

This variant completes in less than 3 seconds. But even when declaring a return type of
      as xs:integer?
which might be the type that can be inferred statically, it completes
fast.

...
Is this possibly a problem with the optimizer?

Which variant of generated code would be preferable for BaseX - typed or untyped?

Thanks, Gunther

Andy Bunce

1:01 p.m.

Looking again at Florent's message http://markmail.org/message/gxi26da4crk2v5ge... After testing it seems is the length of the XQuery comments that trigger the stack overflow in BaseX as well. /Andy

On 29 March 2016 at 16:54, Andy Bunce bunce.andy@gmail.com wrote:

...

...
Did you hear sth. about the performance of MarkLogic

My understanding is that on ML the presence typing blocks some tail-recursion optimizations leading to stack overflow for large texts.

Stack overflow is also an issue for my use of REx with BaseX, I would like to be able to parse sources like the FunctX library without having to set a large Java stack size. [1] (My report that the issue was fixed was incorrect). I would like to investigate if BaseX is using tail-recursion optimizations here and/or where/why this overflow happens - but it all looks very complicated :-).

I suspect the users of XQuery REx are capable of hand tweaking the REx o/p either to add or remove type declarations once the issues are well documented.

/Andy

[1] http://www.mail-archive.com/basex-talk%40mailman.uni-konstanz.de/msg07212.ht...

On 29 March 2016 at 16:24, Christian Grün christian.gruen@gmail.com wrote:

...
Hi Gunther,

Thanks for your mail. After some research, I don’t see a quick way to statically infer the type in the function you mentioned, mostly because it’s called recursively (while statically inferring the type of the function, the return type of map2 function is requested, and it will be the type of the function signature, because the inferred type is not available yet…).

Personally, I would love to see the return types readded, not only because it speeds up BaseX, but also because I think it’s a general advantage of XQuery to have typed functions, and . Do you think there is any chance to revert the change? Did you hear sth. about the performance of MarkLogic – is the version without argument types comparable to BaseX in terms of execution time, or much faster?

All the best, Christian

...
in a recent update of REx parser generator (v5.37), some type

specifications

...
were removed from generated XQuery and XSLT functions. This affects tail-recursive functions, and it was done because it turned out that MarkLogic fails to optimize tail calls, presumably due to an explicit

type

...
check caused by the type specification (see http://markmail.org/message/gxi26da4crk2v5ge ).

Now it was reported that this change causes performance problems with

BaseX

...
(see https://twitter.com/apb1704/status/714219874441146368 ).

I reproduced the problem as follows:

download the XQuery 3.1 grammar from

http://bottlecaps.de/rex/CR-xquery-31-20151217.ebnf

generate a parser from it on http://bottlecaps.de/rex/ using

command

...
line options:
   -xquery -tree -main
run the generated parser, using this command line

basex -binput={42} CR-xquery-31-20151217.xquery

This works OK, but it takes about 80 seconds. Some analysis showed that
the

...
time can be influenced by putting back the type specification to

function

...
p:map2, i.e. declaring 'as xs:integer' as the result type:
   declare function p:map2($c as xs:integer, $lo as xs:integer, $hi
as

...
xs:integer) as xs:integer

This variant completes in less than 3 seconds. But even when declaring a return type of
      as xs:integer?
which might be the type that can be inferred statically, it completes
fast.

...
Is this possibly a problem with the optimizer?

Which variant of generated code would be preferable for BaseX - typed or untyped?

Thanks, Gunther

Christian Grün

1:05 p.m.

So we may probably need to find out if the query can be optimized at all…

On Tue, Mar 29, 2016 at 7:01 PM, Andy Bunce bunce.andy@gmail.com wrote:

...

Looking again at Florent's message http://markmail.org/message/gxi26da4crk2v5ge... After testing it seems is the length of the XQuery comments that trigger the stack overflow in BaseX as well. /Andy

On 29 March 2016 at 16:54, Andy Bunce bunce.andy@gmail.com wrote:

...
...
Did you hear sth. about the performance of MarkLogic

My understanding is that on ML the presence typing blocks some tail-recursion optimizations leading to stack overflow for large texts.

Stack overflow is also an issue for my use of REx with BaseX, I would like to be able to parse sources like the FunctX library without having to set a large Java stack size. [1] (My report that the issue was fixed was incorrect). I would like to investigate if BaseX is using tail-recursion optimizations here and/or where/why this overflow happens - but it all looks very complicated :-).

I suspect the users of XQuery REx are capable of hand tweaking the REx o/p either to add or remove type declarations once the issues are well documented.

/Andy

[1] http://www.mail-archive.com/basex-talk%40mailman.uni-konstanz.de/msg07212.ht...

On 29 March 2016 at 16:24, Christian Grün christian.gruen@gmail.com wrote:

...
Hi Gunther,

Thanks for your mail. After some research, I don’t see a quick way to statically infer the type in the function you mentioned, mostly because it’s called recursively (while statically inferring the type of the function, the return type of map2 function is requested, and it will be the type of the function signature, because the inferred type is not available yet…).

Personally, I would love to see the return types readded, not only because it speeds up BaseX, but also because I think it’s a general advantage of XQuery to have typed functions, and . Do you think there is any chance to revert the change? Did you hear sth. about the performance of MarkLogic – is the version without argument types comparable to BaseX in terms of execution time, or much faster?

All the best, Christian

...
in a recent update of REx parser generator (v5.37), some type specifications were removed from generated XQuery and XSLT functions. This affects tail-recursive functions, and it was done because it turned out that MarkLogic fails to optimize tail calls, presumably due to an explicit type check caused by the type specification (see http://markmail.org/message/gxi26da4crk2v5ge ).

Now it was reported that this change causes performance problems with BaseX (see https://twitter.com/apb1704/status/714219874441146368 ).

I reproduced the problem as follows:

download the XQuery 3.1 grammar from

http://bottlecaps.de/rex/CR-xquery-31-20151217.ebnf

generate a parser from it on http://bottlecaps.de/rex/ using

command line options:
   -xquery -tree -main
run the generated parser, using this command line

basex -binput={42} CR-xquery-31-20151217.xquery

This works OK, but it takes about 80 seconds. Some analysis showed that the time can be influenced by putting back the type specification to function p:map2, i.e. declaring 'as xs:integer' as the result type:
   declare function p:map2($c as xs:integer, $lo as xs:integer, $hi
as xs:integer) as xs:integer

This variant completes in less than 3 seconds. But even when declaring a return type of
      as xs:integer?
which might be the type that can be inferred statically, it completes fast.

Is this possibly a problem with the optimizer?

Which variant of generated code would be preferable for BaseX - typed or untyped?

Thanks, Gunther

Andy Bunce

1:17 p.m.

Hi Gunther, Do I under understand correctly from: https://twitter.com/__Gunther__/status/709744679361912832

Given the way the EBNF for XQuery comments is defined - If (REx) p:transition is not optimized via tail-recursion then long comments will use excessive stack? /Andy

On 29 March 2016 at 18:05, Christian Grün christian.gruen@gmail.com wrote:

...

So we may probably need to find out if the query can be optimized at all…

On Tue, Mar 29, 2016 at 7:01 PM, Andy Bunce bunce.andy@gmail.com wrote:

...
Looking again at Florent's message http://markmail.org/message/gxi26da4crk2v5ge... After testing it seems is the length of the XQuery comments that trigger

the

...
stack overflow in BaseX as well. /Andy

On 29 March 2016 at 16:54, Andy Bunce bunce.andy@gmail.com wrote:

...
...
Did you hear sth. about the performance of MarkLogic

My understanding is that on ML the presence typing blocks some tail-recursion optimizations leading to stack overflow for large texts.

Stack overflow is also an issue for my use of REx with BaseX, I would like to be able to parse sources like the FunctX library without

having to

...
...
set a large Java stack size. [1] (My report that the issue was fixed was incorrect). I would like to investigate if BaseX is using tail-recursion optimizations here and/or where/why this overflow happens - but it all

looks

...
...
very complicated :-).

I suspect the users of XQuery REx are capable of hand tweaking the REx

o/p

...
...
either to add or remove type declarations once the issues are well documented.

/Andy

[1]

http://www.mail-archive.com/basex-talk%40mailman.uni-konstanz.de/msg07212.ht...

...
...
On 29 March 2016 at 16:24, Christian Grün christian.gruen@gmail.com wrote:

...
Hi Gunther,

Thanks for your mail. After some research, I don’t see a quick way to statically infer the type in the function you mentioned, mostly because it’s called recursively (while statically inferring the type of the function, the return type of map2 function is requested, and it will be the type of the function signature, because the inferred type is not available yet…).

Personally, I would love to see the return types readded, not only because it speeds up BaseX, but also because I think it’s a general advantage of XQuery to have typed functions, and . Do you think there is any chance to revert the change? Did you hear sth. about the performance of MarkLogic – is the version without argument types comparable to BaseX in terms of execution time, or much faster?

All the best, Christian

...
in a recent update of REx parser generator (v5.37), some type specifications were removed from generated XQuery and XSLT functions. This affects tail-recursive functions, and it was done because it turned out that MarkLogic fails to optimize tail calls, presumably due to an explicit type check caused by the type specification (see http://markmail.org/message/gxi26da4crk2v5ge ).

Now it was reported that this change causes performance problems with BaseX (see https://twitter.com/apb1704/status/714219874441146368 ).

I reproduced the problem as follows:

download the XQuery 3.1 grammar from

http://bottlecaps.de/rex/CR-xquery-31-20151217.ebnf

generate a parser from it on http://bottlecaps.de/rex/ using

command line options:
   -xquery -tree -main
run the generated parser, using this command line

basex -binput={42} CR-xquery-31-20151217.xquery

This works OK, but it takes about 80 seconds. Some analysis showed
that

...
...
...
...
the time can be influenced by putting back the type specification to function p:map2, i.e. declaring 'as xs:integer' as the result type:
   declare function p:map2($c as xs:integer, $lo as xs:integer,
$hi

...
...
...
...
as xs:integer) as xs:integer

This variant completes in less than 3 seconds. But even when

declaring

...
...
...
...
a return type of
      as xs:integer?
which might be the type that can be inferred statically, it completes fast.

Is this possibly a problem with the optimizer?

Which variant of generated code would be preferable for BaseX - typed or untyped?

Thanks, Gunther

Gunther Rademacher

5:08 p.m.

Hi Andy,

that's correct. p:transition ought to be tail-call optimized to be prepared for tokens of arbitrary size.

Best regards Gunther

On 29.03.2016 19:17, Andy Bunce wrote:

...

Hi Gunther, Do I under understand correctly from: https://twitter.com/__Gunther__/status/709744679361912832

Given the way the EBNF for XQuery comments is defined - If (REx) p:transition is not optimized via tail-recursion then long comments will use excessive stack? /Andy

On 29 March 2016 at 18:05, Christian Grün <christian.gruen@gmail.com mailto:christian.gruen@gmail.com> wrote:

So we may probably need to find out if the query can be optimized at
all…


On Tue, Mar 29, 2016 at 7:01 PM, Andy Bunce <bunce.andy@gmail.com
<mailto:bunce.andy@gmail.com>> wrote:
 > Looking again at Florent's message
 > http://markmail.org/message/gxi26da4crk2v5ge...
 > After testing it seems is the length of the XQuery comments that
trigger the
 > stack overflow in BaseX as well.
 > /Andy
 >
 >
 > On 29 March 2016 at 16:54, Andy Bunce <bunce.andy@gmail.com
<mailto:bunce.andy@gmail.com>> wrote:
 >>
 >> >Did you hear sth. about the performance of MarkLogic
 >> My understanding is that on ML the presence typing blocks some
 >> tail-recursion optimizations leading to stack overflow for large
texts.
 >>
 >> Stack overflow is also an issue for my use of REx with BaseX,  I
would
 >> like to be able to parse sources like the  FunctX library
without having to
 >> set a large Java stack size. [1] (My report that the issue was
fixed was
 >> incorrect). I would like to investigate if BaseX is using
tail-recursion
 >> optimizations here and/or where/why this overflow happens - but
it all looks
 >> very complicated :-).
 >>
 >> I suspect the users of XQuery REx are capable of hand tweaking
the REx o/p
 >> either to add or remove type declarations once the issues are well
 >> documented.
 >>
 >> /Andy
 >>
 >>
 >> [1]
 >>
http://www.mail-archive.com/basex-talk%40mailman.uni-konstanz.de/msg07212.html
 >>
 >> On 29 March 2016 at 16:24, Christian Grün
<christian.gruen@gmail.com <mailto:christian.gruen@gmail.com>>
 >> wrote:
 >>>
 >>> Hi Gunther,
 >>>
 >>> Thanks for your mail. After some research, I don’t see a quick
way to
 >>> statically infer the type in the function you mentioned, mostly
 >>> because it’s called recursively (while statically inferring the
type
 >>> of the function, the return type of map2 function is requested,
and it
 >>> will be the type of the function signature, because the
inferred type
 >>> is not available yet…).
 >>>
 >>> Personally, I would love to see the return types readded, not only
 >>> because it speeds up BaseX, but also because I think it’s a general
 >>> advantage of XQuery to have typed functions, and . Do you think
there
 >>> is any chance to revert the change? Did you hear sth. about the
 >>> performance of MarkLogic – is the version without argument types
 >>> comparable to BaseX in terms of execution time, or much faster?
 >>>
 >>> All the best,
 >>> Christian
 >>>
 >>>
 >>>
 >>> > in a recent update of REx parser generator (v5.37), some type
 >>> > specifications
 >>> > were removed from generated XQuery and XSLT functions. This
affects
 >>> > tail-recursive functions, and it was done because it turned
out that
 >>> > MarkLogic fails to optimize tail calls, presumably due to an
explicit
 >>> > type
 >>> > check caused by the type specification (see
 >>> > http://markmail.org/message/gxi26da4crk2v5ge ).
 >>> >
 >>> > Now it was reported that this change causes performance
problems with
 >>> > BaseX
 >>> > (see https://twitter.com/apb1704/status/714219874441146368 ).
 >>> >
 >>> > I reproduced the problem as follows:
 >>> >
 >>> >   - download the XQuery 3.1 grammar from
 >>> > http://bottlecaps.de/rex/CR-xquery-31-20151217.ebnf
 >>> >
 >>> >   - generate a parser from it on http://bottlecaps.de/rex/ using
 >>> > command
 >>> > line options:
 >>> >
 >>> >        -xquery -tree -main
 >>> >
 >>> >   - run the generated parser, using this command line
 >>> >
 >>> >        basex -binput={42} CR-xquery-31-20151217.xquery
 >>> >
 >>> > This works OK, but it takes about 80 seconds. Some analysis
showed that
 >>> > the
 >>> > time can be influenced by putting back the type specification to
 >>> > function
 >>> > p:map2, i.e. declaring 'as xs:integer' as the result type:
 >>> >
 >>> >        declare function p:map2($c as xs:integer, $lo as
xs:integer, $hi
 >>> > as
 >>> > xs:integer) as xs:integer
 >>> >
 >>> > This variant completes in less than 3 seconds. But even when
declaring
 >>> > a
 >>> > return type of
 >>> >
 >>> >           as xs:integer?
 >>> >
 >>> > which might be the type that can be inferred statically, it
completes
 >>> > fast.
 >>> >
 >>> > Is this possibly a problem with the optimizer?
 >>> >
 >>> > Which variant of generated code would be preferable for BaseX
- typed
 >>> > or
 >>> > untyped?
 >>> >
 >>> > Thanks,
 >>> > Gunther
 >>
 >>
 >

Christian Grün

1:04 p.m.

Hi Andy, hi Gunther,

...

...
Did you hear sth. about the performance of MarkLogic

My understanding is that on ML the presence typing blocks some tail-recursion optimizations leading to stack overflow for large texts.

Yes, this is how I got it, too. However, I was just wondering what’s the runtime for MarkLogic for the version without return types.

I found out it seems to be generally difficult to infer the static type of recursive functions as the one we discussed. However, I found another way to speed up the parser code:

The problem was that the positional access to sequences ($p:TRANSITION[$i0 ...]) was slow if it was not possible to statically assess that $i0 will be numeric. I have now added another implementation of the filter expression for filters with simple, deterministic predicates, which checks the predicate result type at runtime.

If the return type is omitted from p:map2(), the query is now appr. 15% slower, but this might still be ok. I would still love to see the types readded; but after all, this is not my code ;) A new snapshot is available [1].

...

Stack overflow is also an issue for my use of REx with BaseX, I would like to be able to parse sources like the FunctX library without having to set a large Java stack size. [1] (My report that the issue was fixed was incorrect). I would like to investigate if BaseX is using tail-recursion optimizations here and/or where/why this overflow happens - but it all looks very complicated :-).

If you find out more, please let us know!

Christian

[1] http://files.basex.org/releases/latest/

Gunther Rademacher

5:14 p.m.

Hi Christian,

thank you for your analysis. Here is my view of this:

- generally I would second your preference for typed code. It feels cleaner to me, and that's why the type specifications were there originally.

- tail call optimization is crucial for REx generated code. It is necessary to iteratively assemble result sequences while tokenizing/parsing, where these iterations require maintenance of the tokenizer/parser's state. The concrete tasks that use it are

1) assembling token value (p:transition), e.g. comment contents 2) handling EBNF iteration (ZeroOrMore, OneOrMore) - long lists 3) skipping over multiple whitespace items (p:matchW) 3) creating error token lists (p:token) - limited recursion 4) binary search (as in p:map2) - very limited recursion

The first three can hardly go without TCO, the others are rather limited by the number of tokens, or log2 of the number of character classes.

- REx code generation for XQuery was originally developed on Saxon, and I am pretty sure that all of the above are in favor of tail call optimization when run on Saxon, with or without the type spec.

- I have never worked with MarkLogic myself, so no experience here.

- when I ran some tests with BaseX, my impression was that tail call optimization can be achieved for p:transition. Not sure, though.

- when I dropped the type spec, I was hoping not to create other problems with it. Apparently I did, and that is reason enough to rollback this soon. I may create a new "ML" option to address their optimizer's behavior - though I'd prefer to find a common solution.

- to be honest, I am not convinced that XQuery is the ideal language for coding parsers. I was just exploring whether it could be done, when I started on this, and I found that once you have a parse tree in XDM, you can do a lot of interesting tasks on it.

- if anybody was interested, I could generate Java (or other) extension functions for some XQuery processor, like I do for Saxon. That would combine high performance with ease of use. I would love to see Adam Retter's proposal for portable Expath extension functions ( http://www.adamretter.org.uk/presentations/implementation-of-portable-expath... ) go forward. REx is prepared for that, a Haxe code generator for is already available now. Just needs to be fitted to the completed Expath Haxe API, whenever this happens.

Sorry for any inconvenience that this may have caused.

Best regards Gunther

On 29.03.2016 17:24, Christian Grün wrote:

...

Hi Gunther,

Thanks for your mail. After some research, I don’t see a quick way to statically infer the type in the function you mentioned, mostly because it’s called recursively (while statically inferring the type of the function, the return type of map2 function is requested, and it will be the type of the function signature, because the inferred type is not available yet…).

Personally, I would love to see the return types readded, not only because it speeds up BaseX, but also because I think it’s a general advantage of XQuery to have typed functions, and . Do you think there is any chance to revert the change? Did you hear sth. about the performance of MarkLogic – is the version without argument types comparable to BaseX in terms of execution time, or much faster?

All the best, Christian

...
in a recent update of REx parser generator (v5.37), some type specifications were removed from generated XQuery and XSLT functions. This affects tail-recursive functions, and it was done because it turned out that MarkLogic fails to optimize tail calls, presumably due to an explicit type check caused by the type specification (see http://markmail.org/message/gxi26da4crk2v5ge ).

Now it was reported that this change causes performance problems with BaseX (see https://twitter.com/apb1704/status/714219874441146368 ).

I reproduced the problem as follows:

download the XQuery 3.1 grammar from

http://bottlecaps.de/rex/CR-xquery-31-20151217.ebnf

generate a parser from it on http://bottlecaps.de/rex/ using command

line options:
    -xquery -tree -main
run the generated parser, using this command line

basex -binput={42} CR-xquery-31-20151217.xquery

This works OK, but it takes about 80 seconds. Some analysis showed that the time can be influenced by putting back the type specification to function p:map2, i.e. declaring 'as xs:integer' as the result type:
    declare function p:map2($c as xs:integer, $lo as xs:integer, $hi as
xs:integer) as xs:integer

This variant completes in less than 3 seconds. But even when declaring a return type of
       as xs:integer?
which might be the type that can be inferred statically, it completes fast.

Is this possibly a problem with the optimizer?

Which variant of generated code would be preferable for BaseX - typed or untyped?

Thanks, Gunther

Christian Grün

30 Mar 30 Mar

4:29 a.m.

Hi Gunther,

Thanks for your comprehensive feedback!

...

when I ran some tests with BaseX, my impression was that tail call optimization can be achieved for p:transition. Not sure, though.

Out of interest, I checked our query plan (using -x on command-line). As you already guessed, p:transition is tail-call optimized, as well as p:transition, p:matchW and p:token (and various other functions), no matter if return types are used or not.

...

to be honest, I am not convinced that XQuery is the ideal language for coding parsers. I was just exploring whether it could be done, when I started on this, and I found that once you have a parse tree in XDM, you can do a lot of interesting tasks on it.

Yes, it works very well, and it’s interesting to see how many people use it for all kinds of things. In practice, it’s mostly the parsing time that takes time, so I hope we’ll soon manage to keep pre-compiled XQuery modules in main-memory. We are working on that.

...

if anybody was interested, I could generate Java (or other) extension functions for some XQuery processor, like I do for Saxon.

This sounds interesting indeed! Could you provide us with a link to the Saxon extension functions?

...

That would combine high performance with ease of use. I would love
to see Adam Retter's proposal for portable Expath extension
functions
(
http://www.adamretter.org.uk/presentations/implementation-of-portable-expath... ) go forward.

Me too. I’m just not sure how much time it would cost us to make it happen. I’ll put it on my mental queue…

...

Sorry for any inconvenience that this may have caused.

No reason to be sorry. Issue like that are a perfect chance to further improve our processor.

Christian

Gunther Rademacher

11:57 a.m.

Christian Grün

12:07 p.m.

Thanks, this works (I overlooked the "configure" checkbox on the web page); I could successfully compile and run your code.

The numerous imports in the Java source file seem to indicate that it won’t be straightforward to do something similar for BaseX, right? But maybe I’m wrong – if you have time to give it a try, all your questions will be welcome!

Christian

On Wed, Mar 30, 2016 at 5:57 PM, Gunther Rademacher grd@gmx.net wrote:

...

Hi Christian,

...
This sounds interesting indeed! Could you provide us with a link to the Saxon extension functions?

it's an option when generating a parser in Java on the REx website, http://bottlecaps.de/rex/

For parsing XQuery from XQuery via a Saxon extension function,

download the XQuery grammar, e.g.

http://bottlecaps.de/rex/CR-xquery-31-20151217.ebnf
generate a Java-coded parser from it, using these command line options
-java -tree -saxon
compile the result
javac CR_xquery_31_20151217.java
register the extension function, e.g. using the generated Saxon
initializer class:
     java net.sf.saxon.Query
-init:CR_xquery_31_20151217$SaxonInitializer -qs:"declare namespace p='CR_xquery_31_20151217'; p:parse-XQuery('42')" !indent=yes

The Saxon API that is used for building the parse tree has changed recently, so be sure to use Saxon 9.7.0.4 or later.

Best regards Gunther

Christian Grün

31 Mar 31 Mar

5:14 a.m.

Hi Gunther, hi all,

here is a straightforward (yet I somewhat hacky) way to invoke the ReX Java parser code from XQuery:

1. download the XQuery grammar, e.g. http://bottlecaps.de/rex/CR-xquery-31-20151217.ebnf

2. generate a Java-coded parser from it, using these command line options -java -tree -main

3. compile the result javac CR_xquery_31_20151217.java

4. run the attached XQuery files with BaseX or Saxon EE, and with the compiled parser classes in the classpath, e.g.:

java -cp BaseX.jar;. org.basex.BaseX parse-xquery.xq java -cp saxon9ee.jar;. net.sf.saxon.Query parse-xquery.xq

(The semicolon must be replaced with a colon on Unix/Linux-based systems).

In BaseX, for simple inputs, the compiled tree will be available in 5-10 ms. I assume it could be even faster when embedding some native BaseX code in the ReX Parser Generator; but I don’t know how much effort this will be?

Hope this helps, feedback is welcome, Christian

Gunther Rademacher

7:57 a.m.

Hi Christian,

great - and fairly simple.

My proposal would have been to implement a DOM tree builder on the EventHandler interface (where you use XmlSerializer), and that would not require much effort either. It need not necessarily be done in the generated code, but could as well be implemented on top.

I am busy right now, but will be able to present some code tonight.

Is there a different tree model than DOM, that you would prefer for BaseX?

By the way, the generated Saxon imports serve two purposes:

- adapting to the extension function API (necessary when using Saxon-HE) - using Saxon's native tree builder.

Best regards Gunther

-- Gesendet: Donnerstag, 31. März 2016 um 11:14 Uhr Von: "Christian Grün" christian.gruen@gmail.com An: "Gunther Rademacher" grd@gmx.net Cc: BaseX basex-talk@mailman.uni-konstanz.de Betreff: Re: Re: [basex-talk] BaseX optimizer performance on REx-generated parser Hi Gunther, hi all, here is a straightforward (yet I somewhat hacky) way to invoke the ReX Java parser code from XQuery: 1. download the XQuery grammar, e.g. http://bottlecaps.de/rex/CR-xquery-31-20151217.ebnf 2. generate a Java-coded parser from it, using these command line options -java -tree -main 3. compile the result javac CR_xquery_31_20151217.java 4. run the attached XQuery files with BaseX or Saxon EE, and with the compiled parser classes in the classpath, e.g.: java -cp BaseX.jar;. org.basex.BaseX parse-xquery.xq java -cp saxon9ee.jar;. net.sf.saxon.Query parse-xquery.xq (The semicolon must be replaced with a colon on Unix/Linux-based systems). In BaseX, for simple inputs, the compiled tree will be available in 5-10 ms. I assume it could be even faster when embedding some native BaseX code in the ReX Parser Generator; but I don’t know how much effort this will be? Hope this helps, feedback is welcome, Christian

Christian Grün

9:01 a.m.

Hi Gunther,

...

I am busy right now, but will be able to present some code tonight.

Thanks! Take your time.

...

Is there a different tree model than DOM, that you would prefer for BaseX?

I assume that the difference between DOM and String inputs will be marginal. If the method will be called from XQuery, one the fastest solutions is probably to write everything to a temporary string or byte array and create an XQuery node representation (which is an instance of DBNode in BaseX):

import org.basex.io.IO; import org.basex.query.value.node.DBNode;

static DBNode parseXml() throws Exception { String input = "<xml/>"; return new DBNode(IO.get(input)); }

Thinking about this, I noticed that my previous parse-xquery.xq example will be executed faster (from 5ms to 2ms if executed repeatedly) if fn:parse-xml is replaced with with fn:parse-xml-fragment. This is why our internal XML parser instead of Java’s default XML parser is used for the second function.

So this version is probably the best (it is more than 10 times faster than version 1 for small XML documents):

import org.basex.build.xml.XMLParser; import org.basex.core.MainOptions; import org.basex.io.IO; import org.basex.query.value.node.DBNode;

static DBNode parseXml() throws Exception { String input = "<x/>"; XMLParser parser = new XMLParser(IO.get(input), MainOptions.get()); return new DBNode(parser); }

But I’m wondering who’ll eventually care about the difference ;)

Christian

...

By the way, the generated Saxon imports serve two purposes:

adapting to the extension function API (necessary when using Saxon-HE)

using Saxon's native tree builder.

Best regards Gunther --

Gesendet: Donnerstag, 31. März 2016 um 11:14 Uhr Von: "Christian Grün" christian.gruen@gmail.com An: "Gunther Rademacher" grd@gmx.net Cc: BaseX basex-talk@mailman.uni-konstanz.de Betreff: Re: Re: [basex-talk] BaseX optimizer performance on REx-generated parser Hi Gunther, hi all,

here is a straightforward (yet I somewhat hacky) way to invoke the ReX Java parser code from XQuery:

download the XQuery grammar, e.g.

http://bottlecaps.de/rex/CR-xquery-31-20151217.ebnf

generate a Java-coded parser from it, using these command line options

-java -tree -main

compile the result

javac CR_xquery_31_20151217.java

run the attached XQuery files with BaseX or Saxon EE, and with the

compiled parser classes in the classpath, e.g.:

java -cp BaseX.jar;. org.basex.BaseX parse-xquery.xq java -cp saxon9ee.jar;. net.sf.saxon.Query parse-xquery.xq

(The semicolon must be replaced with a colon on Unix/Linux-based systems).

In BaseX, for simple inputs, the compiled tree will be available in 5-10 ms. I assume it could be even faster when embedding some native BaseX code in the ReX Parser Generator; but I don’t know how much effort this will be?

Hope this helps, feedback is welcome, Christian

Andy Bunce

11:13 a.m.

Very nice. I have hacked a ex-path packaged version as a POC...

Install

" https://github.com/expkg-zone58/ex-xparse/releases/download/v0.2.1/xq-xparse... " !repo:install(.)

Run

import module namespace xq="test.org/ns/xq-parse";

"http://www.xqueryfunctions.com/xq/functx-1.0-doc-2007-01.xq" !unparsed-text(.) !xq:parse(.,map{"flatten":true()})

The parsing time is about 250ms for me and no stack overflow. /Andy

On 31 March 2016 at 14:01, Christian Grün christian.gruen@gmail.com wrote:

...

Hi Gunther,

...
I am busy right now, but will be able to present some code tonight.

Thanks! Take your time.

...
Is there a different tree model than DOM, that you would prefer for

BaseX?

I assume that the difference between DOM and String inputs will be marginal. If the method will be called from XQuery, one the fastest solutions is probably to write everything to a temporary string or byte array and create an XQuery node representation (which is an instance of DBNode in BaseX):

import org.basex.io.IO; import org.basex.query.value.node.DBNode;

static DBNode parseXml() throws Exception { String input = "<xml/>"; return new DBNode(IO.get(input)); }

Thinking about this, I noticed that my previous parse-xquery.xq example will be executed faster (from 5ms to 2ms if executed repeatedly) if fn:parse-xml is replaced with with fn:parse-xml-fragment. This is why our internal XML parser instead of Java’s default XML parser is used for the second function.

So this version is probably the best (it is more than 10 times faster than version 1 for small XML documents):

import org.basex.build.xml.XMLParser; import org.basex.core.MainOptions; import org.basex.io.IO; import org.basex.query.value.node.DBNode;

static DBNode parseXml() throws Exception { String input = "<x/>"; XMLParser parser = new XMLParser(IO.get(input), MainOptions.get()); return new DBNode(parser); }

But I’m wondering who’ll eventually care about the difference ;)

Christian

...
By the way, the generated Saxon imports serve two purposes:

adapting to the extension function API (necessary when using

Saxon-HE)

...

using Saxon's native tree builder.

Best regards Gunther --

Gesendet: Donnerstag, 31. März 2016 um 11:14 Uhr Von: "Christian Grün" christian.gruen@gmail.com An: "Gunther Rademacher" grd@gmx.net Cc: BaseX basex-talk@mailman.uni-konstanz.de Betreff: Re: Re: [basex-talk] BaseX optimizer performance on

REx-generated parser

...
Hi Gunther, hi all,

here is a straightforward (yet I somewhat hacky) way to invoke the ReX Java parser code from XQuery:

download the XQuery grammar, e.g.

http://bottlecaps.de/rex/CR-xquery-31-20151217.ebnf

generate a Java-coded parser from it, using these command line options

-java -tree -main

compile the result

javac CR_xquery_31_20151217.java

run the attached XQuery files with BaseX or Saxon EE, and with the

compiled parser classes in the classpath, e.g.:

java -cp BaseX.jar;. org.basex.BaseX parse-xquery.xq java -cp saxon9ee.jar;. net.sf.saxon.Query parse-xquery.xq

(The semicolon must be replaced with a colon on Unix/Linux-based

systems).

...
In BaseX, for simple inputs, the compiled tree will be available in 5-10 ms. I assume it could be even faster when embedding some native BaseX code in the ReX Parser Generator; but I don’t know how much effort this will be?

Hope this helps, feedback is welcome, Christian

Gunther Rademacher

6:32 p.m.

Hi Christian,

please find my code attached. I have tested it along with an XQuery 3.1 parser, that was generated using command line options:

-tree -main -java -saxon It contains the DOM tree builder, as well as your approach using XmlSerializer followed by XML parsing, both for BaseX and for Saxon.

In my tests I have parsed the XQuery code for the same grammar, roughly 1 MB, and counted nodes of the parse tree.

These are the commands that I have used:

java org.basex.BaseX -q "declare namespace p='java:XQueryParser'; p:parseXQueryToDOM(unparsed-text('file:///C:/temp/CR-xquery-31-20151217.xquery'))/count(descendant-or-self::node())" java org.basex.BaseX -q "declare namespace p='java:XQueryParser'; p:parseXQueryToDBNode(unparsed-text('file:///C:/temp/CR-xquery-31-20151217.xquery'))/count(descendant-or-self::node())" java net.sf.saxon.Query -qs:"declare namespace p='java:XQueryParser'; p:parseXQueryToDOM(unparsed-text('file:///C:/temp/CR-xquery-31-20151217.xquery'))/count(descendant-or-self::node())" java net.sf.saxon.Query -init:XQueryParser$SaxonInitializer -qs:"declare namespace p='XQueryParser'; p:parseXQueryToNodeInfo(unparsed-text('file:///C:/temp/CR-xquery-31-20151217.xquery'))/count(descendant-or-self::node())" java net.sf.saxon.Query -init:CR_xquery_31_20151217$SaxonInitializer -qs:"declare namespace p='CR_xquery_31_20151217'; p:parse-XQuery(unparsed-text('file:///C:/temp/CR-xquery-31-20151217.xquery'))/count(descendant-or-self::node())"

And here are the results (best runtime in seconds out of several executions):

| BaseX | SaxonEE ---------------+-----------+------------ DOM builder | 4.48 | 2.98 parseXml | 3.57 | 3.24 native builder | - | 2.36

As you expected, using DOM seems not to be advantageous for BaseX. However the Saxon results suggest that a native tree builder API can do better than parsing XML.

Best regards Gunther

Gesendet: Donnerstag, 31. März 2016 um 15:01 Uhr Von: "Christian Grün" christian.gruen@gmail.com An: "Gunther Rademacher" grd@gmx.net Cc: BaseX basex-talk@mailman.uni-konstanz.de Betreff: Re: [basex-talk] BaseX optimizer performance on REx-generated parser Hi Gunther,

...

I am busy right now, but will be able to present some code tonight.

Thanks! Take your time.

...

Is there a different tree model than DOM, that you would prefer for BaseX?

import org.basex.io.IO; import org.basex.query.value.node.DBNode;

static DBNode parseXml() throws Exception { String input = "<xml/>"; return new DBNode(IO.get(input)); }

So this version is probably the best (it is more than 10 times faster than version 1 for small XML documents):

import org.basex.build.xml.XMLParser; import org.basex.core.MainOptions; import org.basex.io.IO; import org.basex.query.value.node.DBNode;

static DBNode parseXml() throws Exception { String input = "<x/>"; XMLParser parser = new XMLParser(IO.get(input), MainOptions.get()); return new DBNode(parser); }

But I’m wondering who’ll eventually care about the difference ;)

Christian

...

By the way, the generated Saxon imports serve two purposes:

adapting to the extension function API (necessary when using Saxon-HE)

using Saxon's native tree builder.

Best regards Gunther --

Gesendet: Donnerstag, 31. März 2016 um 11:14 Uhr Von: "Christian Grün" christian.gruen@gmail.com An: "Gunther Rademacher" grd@gmx.net Cc: BaseX basex-talk@mailman.uni-konstanz.de Betreff: Re: Re: [basex-talk] BaseX optimizer performance on REx-generated parser Hi Gunther, hi all,

here is a straightforward (yet I somewhat hacky) way to invoke the ReX Java parser code from XQuery:

download the XQuery grammar, e.g.

http://bottlecaps.de/rex/CR-xquery-31-20151217.ebnf

generate a Java-coded parser from it, using these command line options

-java -tree -main

compile the result

javac CR_xquery_31_20151217.java

run the attached XQuery files with BaseX or Saxon EE, and with the

compiled parser classes in the classpath, e.g.:

java -cp BaseX.jar;. org.basex.BaseX parse-xquery.xq java -cp saxon9ee.jar;. net.sf.saxon.Query parse-xquery.xq

(The semicolon must be replaced with a colon on Unix/Linux-based systems).

In BaseX, for simple inputs, the compiled tree will be available in 5-10 ms. I assume it could be even faster when embedding some native BaseX code in the ReX Parser Generator; but I don’t know how much effort this will be?

Hope this helps, feedback is welcome, Christian

Christian Grün

1 Apr 1 Apr

11:57 a.m.

Hi Gunther,

Thanks again! Thanks to your examples, which create 38 MB of serialized XML, I now see why it is in fact beneficial to use a tree builder ;)

I finally looked at your Saxon code a bit closer, and I rewrote it a bit to work with BaseX:

* I added a parse(String query) function, which basically does what ExtensionFunctionCall.call does * I renamed SaxonTreeBuilder to BaseXTreeBuilder, which now calls the appropriate BaseX builder functions * The TopDownTreeBuilder stays unchanged

I have attached the resulting code; it seems to be much faster indeed. Does it make any sense to you? Do you think it would make sense to provide both a Saxon and BaseX option on your parser page?

Christian

On Fri, Apr 1, 2016 at 12:32 AM, Gunther Rademacher grd@gmx.net wrote:

...

Hi Christian,

please find my code attached. I have tested it along with an XQuery 3.1 parser, that was generated using command line options:

-tree -main -java -saxon

It contains the DOM tree builder, as well as your approach using XmlSerializer followed by XML parsing, both for BaseX and for Saxon.

In my tests I have parsed the XQuery code for the same grammar, roughly 1 MB, and counted nodes of the parse tree.

These are the commands that I have used:
java org.basex.BaseX -q "declare namespace p='java:XQueryParser'; p:parseXQueryToDOM(unparsed-text('file:///C:/temp/CR-xquery-31-20151217.xquery'))/count(descendant-or-self::node())"
java org.basex.BaseX -q "declare namespace p='java:XQueryParser'; p:parseXQueryToDBNode(unparsed-text('file:///C:/temp/CR-xquery-31-20151217.xquery'))/count(descendant-or-self::node())"
java net.sf.saxon.Query -qs:"declare namespace p='java:XQueryParser'; p:parseXQueryToDOM(unparsed-text('file:///C:/temp/CR-xquery-31-20151217.xquery'))/count(descendant-or-self::node())"
java net.sf.saxon.Query -init:XQueryParser$SaxonInitializer -qs:"declare namespace p='XQueryParser'; p:parseXQueryToNodeInfo(unparsed-text('file:///C:/temp/CR-xquery-31-20151217.xquery'))/count(descendant-or-self::node())"
java net.sf.saxon.Query -init:CR_xquery_31_20151217$SaxonInitializer -qs:"declare namespace p='CR_xquery_31_20151217'; p:parse-XQuery(unparsed-text('file:///C:/temp/CR-xquery-31-20151217.xquery'))/count(descendant-or-self::node())"
And here are the results (best runtime in seconds out of several executions):
                |   BaseX   |   SaxonEE
 ---------------+-----------+------------
 DOM builder    |    4.48   |      2.98
 parseXml       |    3.57   |      3.24
 native builder |       -   |      2.36
As you expected, using DOM seems not to be advantageous for BaseX. However the Saxon results suggest that a native tree builder API can do better than parsing XML.

Best regards Gunther

Gesendet: Donnerstag, 31. März 2016 um 15:01 Uhr Von: "Christian Grün" christian.gruen@gmail.com An: "Gunther Rademacher" grd@gmx.net Cc: BaseX basex-talk@mailman.uni-konstanz.de Betreff: Re: [basex-talk] BaseX optimizer performance on REx-generated parser Hi Gunther,

...
I am busy right now, but will be able to present some code tonight.

Thanks! Take your time.

...
Is there a different tree model than DOM, that you would prefer for BaseX?

I assume that the difference between DOM and String inputs will be marginal. If the method will be called from XQuery, one the fastest solutions is probably to write everything to a temporary string or byte array and create an XQuery node representation (which is an instance of DBNode in BaseX):

import org.basex.io.IO; import org.basex.query.value.node.DBNode;

static DBNode parseXml() throws Exception { String input = "<xml/>"; return new DBNode(IO.get(input)); }

Thinking about this, I noticed that my previous parse-xquery.xq example will be executed faster (from 5ms to 2ms if executed repeatedly) if fn:parse-xml is replaced with with fn:parse-xml-fragment. This is why our internal XML parser instead of Java’s default XML parser is used for the second function.

So this version is probably the best (it is more than 10 times faster than version 1 for small XML documents):

import org.basex.build.xml.XMLParser; import org.basex.core.MainOptions; import org.basex.io.IO; import org.basex.query.value.node.DBNode;

static DBNode parseXml() throws Exception { String input = "<x/>"; XMLParser parser = new XMLParser(IO.get(input), MainOptions.get()); return new DBNode(parser); }

But I’m wondering who’ll eventually care about the difference ;)

Christian

...
By the way, the generated Saxon imports serve two purposes:

adapting to the extension function API (necessary when using Saxon-HE)

using Saxon's native tree builder.

Best regards Gunther --

Gesendet: Donnerstag, 31. März 2016 um 11:14 Uhr Von: "Christian Grün" christian.gruen@gmail.com An: "Gunther Rademacher" grd@gmx.net Cc: BaseX basex-talk@mailman.uni-konstanz.de Betreff: Re: Re: [basex-talk] BaseX optimizer performance on REx-generated parser Hi Gunther, hi all,

here is a straightforward (yet I somewhat hacky) way to invoke the ReX Java parser code from XQuery:

download the XQuery grammar, e.g.

http://bottlecaps.de/rex/CR-xquery-31-20151217.ebnf

generate a Java-coded parser from it, using these command line options

-java -tree -main

compile the result

javac CR_xquery_31_20151217.java

run the attached XQuery files with BaseX or Saxon EE, and with the

compiled parser classes in the classpath, e.g.:

java -cp BaseX.jar;. org.basex.BaseX parse-xquery.xq java -cp saxon9ee.jar;. net.sf.saxon.Query parse-xquery.xq

(The semicolon must be replaced with a colon on Unix/Linux-based systems).

In BaseX, for simple inputs, the compiled tree will be available in 5-10 ms. I assume it could be even faster when embedding some native BaseX code in the ReX Parser Generator; but I don’t know how much effort this will be?

Hope this helps, feedback is welcome, Christian

Gunther Rademacher

4 Apr 4 Apr

2:07 a.m.

Hi Christian,

thank you for the tree builder proposal, it works fine indeed.

I have slightly modified the extension function such that it behaves the same as generated XQuery code, so can be used to replace it without further adaptations of the code that calls it.

Also I have used Str rather than String, in order to create a unique signature identifying a BaseX extension function.

Finally, the call of the parser's parse_x method was isolated in order to prepare for multiple extension functions in a single class. This occurs when there are multiple start symbols in a grammar.

The modified code is attached to this mail. It is stripped down to what would be added to REx-generated code for '-basex'.

Best regards Gunther

Gesendet: Freitag, 01. April 2016 um 17:57 Uhr Von: "Christian Grün" christian.gruen@gmail.com An: "Gunther Rademacher" grd@gmx.net Cc: BaseX basex-talk@mailman.uni-konstanz.de Betreff: Re: [basex-talk] BaseX optimizer performance on REx-generated parser Hi Gunther,

Thanks again! Thanks to your examples, which create 38 MB of serialized XML, I now see why it is in fact beneficial to use a tree builder ;)

I finally looked at your Saxon code a bit closer, and I rewrote it a bit to work with BaseX:

I have attached the resulting code; it seems to be much faster indeed. Does it make any sense to you? Do you think it would make sense to provide both a Saxon and BaseX option on your parser page?

Christian

On Fri, Apr 1, 2016 at 12:32 AM, Gunther Rademacher grd@gmx.net wrote:

...

Hi Christian,

please find my code attached. I have tested it along with an XQuery 3.1 parser, that was generated using command line options:

-tree -main -java -saxon

It contains the DOM tree builder, as well as your approach using XmlSerializer followed by XML parsing, both for BaseX and for Saxon.

In my tests I have parsed the XQuery code for the same grammar, roughly 1 MB, and counted nodes of the parse tree.

These are the commands that I have used:

java org.basex.BaseX -q "declare namespace p='java:XQueryParser'; p:parseXQueryToDOM(unparsed-text('file:///C:/temp/CR-xquery-31-20151217.xquery'))/count(descendant-or-self::node())" java org.basex.BaseX -q "declare namespace p='java:XQueryParser'; p:parseXQueryToDBNode(unparsed-text('file:///C:/temp/CR-xquery-31-20151217.xquery'))/count(descendant-or-self::node())" java net.sf.saxon.Query -qs:"declare namespace p='java:XQueryParser'; p:parseXQueryToDOM(unparsed-text('file:///C:/temp/CR-xquery-31-20151217.xquery'))/count(descendant-or-self::node())" java net.sf.saxon.Query -init:XQueryParser$SaxonInitializer -qs:"declare namespace p='XQueryParser'; p:parseXQueryToNodeInfo(unparsed-text('file:///C:/temp/CR-xquery-31-20151217.xquery'))/count(descendant-or-self::node())" java net.sf.saxon.Query -init:CR_xquery_31_20151217$SaxonInitializer -qs:"declare namespace p='CR_xquery_31_20151217'; p:parse-XQuery(unparsed-text('file:///C:/temp/CR-xquery-31-20151217.xquery'))/count(descendant-or-self::node())"

And here are the results (best runtime in seconds out of several executions):

| BaseX | SaxonEE ---------------+-----------+------------ DOM builder | 4.48 | 2.98 parseXml | 3.57 | 3.24 native builder | - | 2.36

As you expected, using DOM seems not to be advantageous for BaseX. However the Saxon results suggest that a native tree builder API can do better than parsing XML.

Best regards Gunther

Gesendet: Donnerstag, 31. März 2016 um 15:01 Uhr Von: "Christian Grün" christian.gruen@gmail.com An: "Gunther Rademacher" grd@gmx.net Cc: BaseX basex-talk@mailman.uni-konstanz.de Betreff: Re: [basex-talk] BaseX optimizer performance on REx-generated parser Hi Gunther,

...
I am busy right now, but will be able to present some code tonight.

Thanks! Take your time.

...
Is there a different tree model than DOM, that you would prefer for BaseX?

I assume that the difference between DOM and String inputs will be marginal. If the method will be called from XQuery, one the fastest solutions is probably to write everything to a temporary string or byte array and create an XQuery node representation (which is an instance of DBNode in BaseX):

import org.basex.io.IO; import org.basex.query.value.node.DBNode;

static DBNode parseXml() throws Exception { String input = "<xml/>"; return new DBNode(IO.get(input)); }

Thinking about this, I noticed that my previous parse-xquery.xq example will be executed faster (from 5ms to 2ms if executed repeatedly) if fn:parse-xml is replaced with with fn:parse-xml-fragment. This is why our internal XML parser instead of Java’s default XML parser is used for the second function.

So this version is probably the best (it is more than 10 times faster than version 1 for small XML documents):

import org.basex.build.xml.XMLParser; import org.basex.core.MainOptions; import org.basex.io.IO; import org.basex.query.value.node.DBNode;

static DBNode parseXml() throws Exception { String input = "<x/>"; XMLParser parser = new XMLParser(IO.get(input), MainOptions.get()); return new DBNode(parser); }

But I’m wondering who’ll eventually care about the difference ;)

Christian

...
By the way, the generated Saxon imports serve two purposes:

adapting to the extension function API (necessary when using Saxon-HE)

using Saxon's native tree builder.

Best regards Gunther --

Gesendet: Donnerstag, 31. März 2016 um 11:14 Uhr Von: "Christian Grün" christian.gruen@gmail.com An: "Gunther Rademacher" grd@gmx.net Cc: BaseX basex-talk@mailman.uni-konstanz.de Betreff: Re: Re: [basex-talk] BaseX optimizer performance on REx-generated parser Hi Gunther, hi all,

here is a straightforward (yet I somewhat hacky) way to invoke the ReX Java parser code from XQuery:

download the XQuery grammar, e.g.

http://bottlecaps.de/rex/CR-xquery-31-20151217.ebnf

generate a Java-coded parser from it, using these command line options

-java -tree -main

compile the result

javac CR_xquery_31_20151217.java

run the attached XQuery files with BaseX or Saxon EE, and with the

compiled parser classes in the classpath, e.g.:

java -cp BaseX.jar;. org.basex.BaseX parse-xquery.xq java -cp saxon9ee.jar;. net.sf.saxon.Query parse-xquery.xq

(The semicolon must be replaced with a colon on Unix/Linux-based systems).

In BaseX, for simple inputs, the compiled tree will be available in 5-10 ms. I assume it could be even faster when embedding some native BaseX code in the ReX Parser Generator; but I don’t know how much effort this will be?

Hope this helps, feedback is welcome, Christian

Christian Grün

1:36 p.m.

Hi Gunther,

Your code looks excellent. And I was amazed to see you’ve even found out how to build the error node via our MemBuilder code (which is not really well-documented, compared to real APIs)… I’ll be happy to play around with it and give more feedback once the -basex option will be available.

Thanks! Christian

On Mon, Apr 4, 2016 at 8:07 AM, Gunther Rademacher grd@gmx.net wrote:

...

Hi Christian,

thank you for the tree builder proposal, it works fine indeed.

I have slightly modified the extension function such that it behaves the same as generated XQuery code, so can be used to replace it without further adaptations of the code that calls it.

Also I have used Str rather than String, in order to create a unique signature identifying a BaseX extension function.

Finally, the call of the parser's parse_x method was isolated in order to prepare for multiple extension functions in a single class. This occurs when there are multiple start symbols in a grammar.

The modified code is attached to this mail. It is stripped down to what would be added to REx-generated code for '-basex'.

Best regards Gunther

Gesendet: Freitag, 01. April 2016 um 17:57 Uhr Von: "Christian Grün" christian.gruen@gmail.com An: "Gunther Rademacher" grd@gmx.net Cc: BaseX basex-talk@mailman.uni-konstanz.de Betreff: Re: [basex-talk] BaseX optimizer performance on REx-generated parser Hi Gunther,

Thanks again! Thanks to your examples, which create 38 MB of serialized XML, I now see why it is in fact beneficial to use a tree builder ;)

I finally looked at your Saxon code a bit closer, and I rewrote it a bit to work with BaseX:

I added a parse(String query) function, which basically does what

ExtensionFunctionCall.call does

I renamed SaxonTreeBuilder to BaseXTreeBuilder, which now calls the

appropriate BaseX builder functions

The TopDownTreeBuilder stays unchanged

I have attached the resulting code; it seems to be much faster indeed. Does it make any sense to you? Do you think it would make sense to provide both a Saxon and BaseX option on your parser page?

Christian

On Fri, Apr 1, 2016 at 12:32 AM, Gunther Rademacher grd@gmx.net wrote:

...
Hi Christian,

please find my code attached. I have tested it along with an XQuery 3.1 parser, that was generated using command line options:

-tree -main -java -saxon

It contains the DOM tree builder, as well as your approach using XmlSerializer followed by XML parsing, both for BaseX and for Saxon.

In my tests I have parsed the XQuery code for the same grammar, roughly 1 MB, and counted nodes of the parse tree.

These are the commands that I have used:

java org.basex.BaseX -q "declare namespace p='java:XQueryParser'; p:parseXQueryToDOM(unparsed-text('file:///C:/temp/CR-xquery-31-20151217.xquery'))/count(descendant-or-self::node())" java org.basex.BaseX -q "declare namespace p='java:XQueryParser'; p:parseXQueryToDBNode(unparsed-text('file:///C:/temp/CR-xquery-31-20151217.xquery'))/count(descendant-or-self::node())" java net.sf.saxon.Query -qs:"declare namespace p='java:XQueryParser'; p:parseXQueryToDOM(unparsed-text('file:///C:/temp/CR-xquery-31-20151217.xquery'))/count(descendant-or-self::node())" java net.sf.saxon.Query -init:XQueryParser$SaxonInitializer -qs:"declare namespace p='XQueryParser'; p:parseXQueryToNodeInfo(unparsed-text('file:///C:/temp/CR-xquery-31-20151217.xquery'))/count(descendant-or-self::node())" java net.sf.saxon.Query -init:CR_xquery_31_20151217$SaxonInitializer -qs:"declare namespace p='CR_xquery_31_20151217'; p:parse-XQuery(unparsed-text('file:///C:/temp/CR-xquery-31-20151217.xquery'))/count(descendant-or-self::node())"

And here are the results (best runtime in seconds out of several executions):

| BaseX | SaxonEE ---------------+-----------+------------ DOM builder | 4.48 | 2.98 parseXml | 3.57 | 3.24 native builder | - | 2.36

As you expected, using DOM seems not to be advantageous for BaseX. However the Saxon results suggest that a native tree builder API can do better than parsing XML.

Best regards Gunther

Gesendet: Donnerstag, 31. März 2016 um 15:01 Uhr Von: "Christian Grün" christian.gruen@gmail.com An: "Gunther Rademacher" grd@gmx.net Cc: BaseX basex-talk@mailman.uni-konstanz.de Betreff: Re: [basex-talk] BaseX optimizer performance on REx-generated parser Hi Gunther,

...
I am busy right now, but will be able to present some code tonight.

Thanks! Take your time.

...
Is there a different tree model than DOM, that you would prefer for BaseX?

I assume that the difference between DOM and String inputs will be marginal. If the method will be called from XQuery, one the fastest solutions is probably to write everything to a temporary string or byte array and create an XQuery node representation (which is an instance of DBNode in BaseX):

import org.basex.io.IO; import org.basex.query.value.node.DBNode;

static DBNode parseXml() throws Exception { String input = "<xml/>"; return new DBNode(IO.get(input)); }

Thinking about this, I noticed that my previous parse-xquery.xq example will be executed faster (from 5ms to 2ms if executed repeatedly) if fn:parse-xml is replaced with with fn:parse-xml-fragment. This is why our internal XML parser instead of Java’s default XML parser is used for the second function.

So this version is probably the best (it is more than 10 times faster than version 1 for small XML documents):

import org.basex.build.xml.XMLParser; import org.basex.core.MainOptions; import org.basex.io.IO; import org.basex.query.value.node.DBNode;

static DBNode parseXml() throws Exception { String input = "<x/>"; XMLParser parser = new XMLParser(IO.get(input), MainOptions.get()); return new DBNode(parser); }

But I’m wondering who’ll eventually care about the difference ;)

Christian

...
By the way, the generated Saxon imports serve two purposes:

adapting to the extension function API (necessary when using Saxon-HE)

using Saxon's native tree builder.

Best regards Gunther --

Gesendet: Donnerstag, 31. März 2016 um 11:14 Uhr Von: "Christian Grün" christian.gruen@gmail.com An: "Gunther Rademacher" grd@gmx.net Cc: BaseX basex-talk@mailman.uni-konstanz.de Betreff: Re: Re: [basex-talk] BaseX optimizer performance on REx-generated parser Hi Gunther, hi all,

here is a straightforward (yet I somewhat hacky) way to invoke the ReX Java parser code from XQuery:

download the XQuery grammar, e.g.

http://bottlecaps.de/rex/CR-xquery-31-20151217.ebnf

generate a Java-coded parser from it, using these command line options

-java -tree -main

compile the result

javac CR_xquery_31_20151217.java

run the attached XQuery files with BaseX or Saxon EE, and with the

compiled parser classes in the classpath, e.g.:

java -cp BaseX.jar;. org.basex.BaseX parse-xquery.xq java -cp saxon9ee.jar;. net.sf.saxon.Query parse-xquery.xq

(The semicolon must be replaced with a colon on Unix/Linux-based systems).

In BaseX, for simple inputs, the compiled tree will be available in 5-10 ms. I assume it could be even faster when embedding some native BaseX code in the ReX Parser Generator; but I don’t know how much effort this will be?

Hope this helps, feedback is welcome, Christian

Gunther Rademacher

2:57 p.m.

Hi Christian,

thanks for reviewing the code!

REx v5.38 is now online and it includes the -basex option. Use it as follows:

- create some grammar G.ebnf, e.g. by using a sample grammar from http://bottlecaps.de/rex/ - create a parser by generating it on http://bottlecaps.de/rex/ using command line options: -tree -java -basex - compile G.java and make class files available to BaseX classpath

- invoke parser from XQuery, e.g.

declare namespace p="G"; p:parse-S($input)

where S is a start symbol of the grammar.

- for an XQuery coded parser, use REx options: -tree -xquery

- for just a syntax checker, with no parse tree, omit option: -tree

If you encounter any problems with this, please let me know.

Also please note that I have restored the type of p:match2 in generated XQuery code, which cause bad performance problem with earlier BaseX versions (no longer with 8.4.2). p:transition and some others still come without a type spec, but they are not known to cause any problem and this saves me adding another case distinction to code generation.

Best regards Gunther

Gesendet: Montag, 04. April 2016 um 19:36 Uhr Von: "Christian Grün" christian.gruen@gmail.com An: "Gunther Rademacher" grd@gmx.net Cc: BaseX basex-talk@mailman.uni-konstanz.de Betreff: Re: [basex-talk] BaseX optimizer performance on REx-generated parser Hi Gunther,

Thanks! Christian

Christian Grün

5 Apr 5 Apr

6:21 a.m.

…works fine!

On Mon, Apr 4, 2016 at 8:57 PM, Gunther Rademacher grd@gmx.net wrote:

...

Hi Christian,

thanks for reviewing the code!

REx v5.38 is now online and it includes the -basex option. Use it as follows:
create some grammar G.ebnf, e.g. by using a sample grammar from http://bottlecaps.de/rex/

create a parser by generating it on http://bottlecaps.de/rex/ using command line options: -tree -java -basex

compile G.java and make class files available to BaseX classpath
invoke parser from XQuery, e.g.
declare namespace p="G";
p:parse-S($input)
where S is a start symbol of the grammar.
for an XQuery coded parser, use REx options: -tree -xquery

for just a syntax checker, with no parse tree, omit option: -tree
If you encounter any problems with this, please let me know.

Also please note that I have restored the type of p:match2 in generated XQuery code, which cause bad performance problem with earlier BaseX versions (no longer with 8.4.2). p:transition and some others still come without a type spec, but they are not known to cause any problem and this saves me adding another case distinction to code generation.

Best regards Gunther

Gesendet: Montag, 04. April 2016 um 19:36 Uhr Von: "Christian Grün" christian.gruen@gmail.com An: "Gunther Rademacher" grd@gmx.net Cc: BaseX basex-talk@mailman.uni-konstanz.de Betreff: Re: [basex-talk] BaseX optimizer performance on REx-generated parser Hi Gunther,

Your code looks excellent. And I was amazed to see you’ve even found out how to build the error node via our MemBuilder code (which is not really well-documented, compared to real APIs)… I’ll be happy to play around with it and give more feedback once the -basex option will be available.

Thanks! Christian

3390

Age (days ago)

3398

Last active (days ago)

basex-talk@mailman.uni-konstanz.de

21 comments

3 participants

tags (0)

participants (3)

Andy Bunce
Christian Grün
Gunther Rademacher