Hi Christian,
thank you for the tree builder proposal, it works fine indeed.
I have slightly modified the extension function such that it behaves the same as generated XQuery code, so can be used to replace it without further adaptations of the code that calls it.
Also I have used Str rather than String, in order to create a unique signature identifying a BaseX extension function.
Finally, the call of the parser's parse_x method was isolated in order to prepare for multiple extension functions in a single class. This occurs when there are multiple start symbols in a grammar.
The modified code is attached to this mail. It is stripped down to what would be added to REx-generated code for '-basex'.
Best regards Gunther
Gesendet: Freitag, 01. April 2016 um 17:57 Uhr Von: "Christian Grün" christian.gruen@gmail.com An: "Gunther Rademacher" grd@gmx.net Cc: BaseX basex-talk@mailman.uni-konstanz.de Betreff: Re: [basex-talk] BaseX optimizer performance on REx-generated parser Hi Gunther,
Thanks again! Thanks to your examples, which create 38 MB of serialized XML, I now see why it is in fact beneficial to use a tree builder ;)
I finally looked at your Saxon code a bit closer, and I rewrote it a bit to work with BaseX:
* I added a parse(String query) function, which basically does what ExtensionFunctionCall.call does * I renamed SaxonTreeBuilder to BaseXTreeBuilder, which now calls the appropriate BaseX builder functions * The TopDownTreeBuilder stays unchanged
I have attached the resulting code; it seems to be much faster indeed. Does it make any sense to you? Do you think it would make sense to provide both a Saxon and BaseX option on your parser page?
Christian
On Fri, Apr 1, 2016 at 12:32 AM, Gunther Rademacher grd@gmx.net wrote:
Hi Christian,
please find my code attached. I have tested it along with an XQuery 3.1 parser, that was generated using command line options:
-tree -main -java -saxon
It contains the DOM tree builder, as well as your approach using XmlSerializer followed by XML parsing, both for BaseX and for Saxon.
In my tests I have parsed the XQuery code for the same grammar, roughly 1 MB, and counted nodes of the parse tree.
These are the commands that I have used:
java org.basex.BaseX -q "declare namespace p='java:XQueryParser'; p:parseXQueryToDOM(unparsed-text('file:///C:/temp/CR-xquery-31-20151217.xquery'))/count(descendant-or-self::node())" java org.basex.BaseX -q "declare namespace p='java:XQueryParser'; p:parseXQueryToDBNode(unparsed-text('file:///C:/temp/CR-xquery-31-20151217.xquery'))/count(descendant-or-self::node())" java net.sf.saxon.Query -qs:"declare namespace p='java:XQueryParser'; p:parseXQueryToDOM(unparsed-text('file:///C:/temp/CR-xquery-31-20151217.xquery'))/count(descendant-or-self::node())" java net.sf.saxon.Query -init:XQueryParser$SaxonInitializer -qs:"declare namespace p='XQueryParser'; p:parseXQueryToNodeInfo(unparsed-text('file:///C:/temp/CR-xquery-31-20151217.xquery'))/count(descendant-or-self::node())" java net.sf.saxon.Query -init:CR_xquery_31_20151217$SaxonInitializer -qs:"declare namespace p='CR_xquery_31_20151217'; p:parse-XQuery(unparsed-text('file:///C:/temp/CR-xquery-31-20151217.xquery'))/count(descendant-or-self::node())"
And here are the results (best runtime in seconds out of several executions):
| BaseX | SaxonEE ---------------+-----------+------------ DOM builder | 4.48 | 2.98 parseXml | 3.57 | 3.24 native builder | - | 2.36
As you expected, using DOM seems not to be advantageous for BaseX. However the Saxon results suggest that a native tree builder API can do better than parsing XML.
Best regards Gunther
Gesendet: Donnerstag, 31. März 2016 um 15:01 Uhr Von: "Christian Grün" christian.gruen@gmail.com An: "Gunther Rademacher" grd@gmx.net Cc: BaseX basex-talk@mailman.uni-konstanz.de Betreff: Re: [basex-talk] BaseX optimizer performance on REx-generated parser Hi Gunther,
I am busy right now, but will be able to present some code tonight.
Thanks! Take your time.
Is there a different tree model than DOM, that you would prefer for BaseX?
I assume that the difference between DOM and String inputs will be marginal. If the method will be called from XQuery, one the fastest solutions is probably to write everything to a temporary string or byte array and create an XQuery node representation (which is an instance of DBNode in BaseX):
import org.basex.io.IO; import org.basex.query.value.node.DBNode;
static DBNode parseXml() throws Exception { String input = "<xml/>"; return new DBNode(IO.get(input)); }
Thinking about this, I noticed that my previous parse-xquery.xq example will be executed faster (from 5ms to 2ms if executed repeatedly) if fn:parse-xml is replaced with with fn:parse-xml-fragment. This is why our internal XML parser instead of Java’s default XML parser is used for the second function.
So this version is probably the best (it is more than 10 times faster than version 1 for small XML documents):
import org.basex.build.xml.XMLParser; import org.basex.core.MainOptions; import org.basex.io.IO; import org.basex.query.value.node.DBNode;
static DBNode parseXml() throws Exception { String input = "<x/>"; XMLParser parser = new XMLParser(IO.get(input), MainOptions.get()); return new DBNode(parser); }
But I’m wondering who’ll eventually care about the difference ;)
Christian
By the way, the generated Saxon imports serve two purposes:
- adapting to the extension function API (necessary when using Saxon-HE)
- using Saxon's native tree builder.
Best regards Gunther --
Gesendet: Donnerstag, 31. März 2016 um 11:14 Uhr Von: "Christian Grün" christian.gruen@gmail.com An: "Gunther Rademacher" grd@gmx.net Cc: BaseX basex-talk@mailman.uni-konstanz.de Betreff: Re: Re: [basex-talk] BaseX optimizer performance on REx-generated parser Hi Gunther, hi all,
here is a straightforward (yet I somewhat hacky) way to invoke the ReX Java parser code from XQuery:
- download the XQuery grammar, e.g.
http://bottlecaps.de/rex/CR-xquery-31-20151217.ebnf
- generate a Java-coded parser from it, using these command line options
-java -tree -main
- compile the result
javac CR_xquery_31_20151217.java
- run the attached XQuery files with BaseX or Saxon EE, and with the
compiled parser classes in the classpath, e.g.:
java -cp BaseX.jar;. org.basex.BaseX parse-xquery.xq java -cp saxon9ee.jar;. net.sf.saxon.Query parse-xquery.xq
(The semicolon must be replaced with a colon on Unix/Linux-based systems).
In BaseX, for simple inputs, the compiled tree will be available in 5-10 ms. I assume it could be even faster when embedding some native BaseX code in the ReX Parser Generator; but I don’t know how much effort this will be?
Hope this helps, feedback is welcome, Christian