We are testing basex to use with a huge database, and we have a performance problem.
At the moment we have two small databases with thousands of documents to test:
· Authors, 2.000 documents, 2MBs aprox.
· Books (3 per author) with 6.000 documents, 5MBs aprox.
We need join and group results, and this takes a long time:
Query: <result>{for $Author in (db:open('Authors')//businessEvent), $Book in (db:open('Books')//businessEvent) for $income in distinct-values($Author//author/country) where $Author//author/idauthor=$Book//book/idauthor group by $income return <line><label>{ $income }</label><data>count($Book )</data></line>}</result>
Compiling:
- rewriting where clause to predicate(s)
- pre-evaluating db:open("Authors")
- simplifying descendant-or-self step(s)
- pre-evaluating db:open("Books")
- simplifying descendant-or-self step(s)
- simplifying descendant-or-self step(s)
- swapping operands: descendant::*:book/*:idauthor = $Author/descendant::author/idauthor
- simplifying descendant-or-self step(s)
- simplifying descendant-or-self step(s)
Result: element result { for $Author in (document-node { "1" }, document-node { "2" }, ...)/descendant::*:businessEvent for $Book in (document-node { "1-1" }, document-node { "1-2" }, ...)/descendant::*:businessEvent[descendant::*:book/*:idauthor = $Author/descendant::author/idauthor] for $income in fn:distinct-values($Author/descendant::author/country) group by $income := $income return element line { element label { $income }, element data { "count($Book )" } } }
Timing:
- Parsing: 1.01 ms
- Compiling: 3.99 ms
- Evaluating: 100725.38 ms
- Printing: 0.37 ms
- Total Time: 100730.77 ms
We have no idea why it’s taking so long time to show the results.
If we don’t group results it takes a long time too.
Regards,
Nuria
Hi Nuria,
Am 18.02.2013 12:10, schrieb Nuria Garcia:
Compiling:
- rewriting where clause to predicate(s)
- pre-evaluating db:open("Authors")
- simplifying descendant-or-self step(s)
- pre-evaluating db:open("Books")
- simplifying descendant-or-self step(s)
- simplifying descendant-or-self step(s)
- swapping operands: descendant::*:book/*:idauthor = $Author/descendant::author/idauthor
- simplifying descendant-or-self step(s)
- simplifying descendant-or-self step(s)
It seems that the predicate into which the `where` clause is rewritten is not converted into an index access. Nothing in the query should cause this, so the problem seems to be (one of) the databases.
Are you sure that the text index of the "Books" database is up-to-date? You can check that in the GUI under "Database -> Open & Manage..." or with the command `INFO INDEX TEXT` [1]. If it is outdated, you have to update it via "Database -> Properties... -> Optimize..." in the GUI or the `OPTIMIZE` command [2]. You can also recreate the database with incremental updates activated via the `OPDINDEX` option [3], but this only affects the text and attribute indices.
Hope this helps, cheers, Leo
[1] http://docs.basex.org/wiki/Commands#INFO_INDEX [2] http://docs.basex.org/wiki/Commands#OPTIMIZE [3] http://docs.basex.org/wiki/Options#UPDINDEX
Thanks for your quick answer. Query execution improves its timing a lot.
One more question, in a similar query but with attributes, it looks like the index doesn´t get applied (more similar to our real situation):
XML is similar to:
<businessEvent> <inputs> <input name="data"> <input element="229388252" name="id" type="String"/> <input element="2751 XTE" name="matriculation" type="String"/> <input element="XTETVTTJKC" name="clientId" type="String"/> <input name="driver"> <input element="driver name" name="name" type="String"/> <input element="driver surname" name="surname" type="String"/> <input element="21973887X" name="nif" type="String"/> <input element="line 1" name="line1" type="String"/> <input element="Line 2" name="line2" type="String"/> <input element="City name" name="city" type="String"/> <input element="28197" name="cp" type="String"/> <input element="Region7" name="region" type="String"/> <input element="Country" name="country" type="String"/> </input> <input name="accident"> <input name="date"> <input element="2012" name="year" type="String"/> <input element="13" name="month" type="String"/> <input element="22" name="day" type="String"/> <input element="2" name="hour" type="String"/> <input element="53" name="minute" type="String"/> </input> <input name="address"> <input element="Avenue 5" name="line1" type="String"/> <input element="Line2" name="line2" type="String"/> <input element="Madrid" name="city" type="String"/> <input element="28224" name="cp" type="String"/> <input element="Region3" name="region" type="String"/> <input element="Spain" name="country" type="String"/> </input> <input element="Lorem ipsum dolor …." name="declaration" type="String"/> </input> </input> </inputs> </businessEvent>
Execution:
Query: <result>{ for $col1 in (db:open('bbdd_1')//businessEvent ), $col2 in (db:open('bbdd2')//businessEvent) for $income in distinct-values($col2//inputs/input[@name='data']/input[@name='accident']/input[@name='address']/input[@name='region']/data(@element)) where $col1//inputs/input[@name='data']/input[@name='id']/data(@element)=$col2//inputs/input[@name='data']/input[@name='id']/data(@element) group by $income return <line><label>{ $income }</label><data>{ count($col2) }</data></line>}</result> Compiling: - rewriting where clause to predicate(s) - pre-evaluating db:open("bbdd_1") - simplifying descendant-or-self step(s) - converting descendant::*:businessEvent to child steps - pre-evaluating db:open("bbdd_2") - simplifying descendant-or-self step(s) - simplifying descendant-or-self step(s) - swapping operands: descendant::*:inputs/*:input[@*:name = "data"]/*:input[@*:name = "id"]/fn:data(@*:element) = $col1/descendant::inputs/input[@name = "data"]/input[@name = "id"]/fn:data(@element) - simplifying descendant-or-self step(s) - converting descendant::*:businessEvent[descendant::*:inputs/*:input[@*:name = "data"]/*:input[@*:name = "id"]/fn:data(@*:element) = $col1/descendant::inputs/input[@name = "data"]/input[@name = "id"]/fn:data(@element)] to child steps - simplifying descendant-or-self step(s) Result: element result { for $col1 in (document-node { "1361200546116" }, ...)/*:businessEvent for $col2 in (document-node { "1361200600857" }, ...)/*:businessEvent[descendant::*:inputs/*:input[@*:name = "data"]/*:input[@*:name = "id"]/fn:data(@*:element) = $col1/descendant::inputs/input[@name = "data"]/input[@name = "id"]/fn:data(@element)] for $income in fn:distinct-values($col2/descendant::inputs/input[@name = "data"]/input[@name = "accident"]/input[@name = "address"]/input[@name = "region"]/fn:data(@element)) group by $income := $income return element line { element label { $income }, element data { fn:count($col2) } } }
Timing: - Parsing: 1.24 ms - Compiling: 7.44 ms - Evaluating: 36090.05 ms - Printing: 0.64 ms - Total Time: 36099.37 ms
Is it possible that the index doesn’t get applied?, how could I improve this query?
Date: Mon, 18 Feb 2013 13:35:56 +0100 From: lw@basex.org To: narayola@hotmail.com CC: basex-talk@mailman.uni-konstanz.de Subject: Re: [basex-talk] Performance problem
Hi Nuria,
Am 18.02.2013 12:10, schrieb Nuria Garcia:
Compiling:
- rewriting where clause to predicate(s)
- pre-evaluating db:open("Authors")
- simplifying descendant-or-self step(s)
- pre-evaluating db:open("Books")
- simplifying descendant-or-self step(s)
- simplifying descendant-or-self step(s)
- swapping operands: descendant::*:book/*:idauthor = $Author/descendant::author/idauthor
- simplifying descendant-or-self step(s)
- simplifying descendant-or-self step(s)
It seems that the predicate into which the `where` clause is rewritten is not converted into an index access. Nothing in the query should cause this, so the problem seems to be (one of) the databases.
Are you sure that the text index of the "Books" database is up-to-date? You can check that in the GUI under "Database -> Open & Manage..." or with the command `INFO INDEX TEXT` [1]. If it is outdated, you have to update it via "Database -> Properties... -> Optimize..." in the GUI or the `OPTIMIZE` command [2]. You can also recreate the database with incremental updates activated via the `OPDINDEX` option [3], but this only affects the text and attribute indices.
Hope this helps, cheers, Leo
[1] http://docs.basex.org/wiki/Commands#INFO_INDEX [2] http://docs.basex.org/wiki/Commands#OPTIMIZE [3] http://docs.basex.org/wiki/Options#UPDINDEX
Dear Nuria,
Am 18.02.2013 16:39, schrieb Nuria Garcia:
Thanks for your quick answer.
sorry for taking so long this time...
One more question, in a similar query but with attributes, it looks like the index doesn´t get applied (more similar to our real situation):
Compiling:
- rewriting where clause to predicate(s)
- pre-evaluating db:open("bbdd_1")
- simplifying descendant-or-self step(s)
- converting descendant::*:businessEvent to child steps
- pre-evaluating db:open("bbdd_2")
- simplifying descendant-or-self step(s)
- simplifying descendant-or-self step(s)
- swapping operands: descendant::*:inputs/.../fn:data(@*:element) = $col1/descendant::inputs/.../fn:data(@element)
- simplifying descendant-or-self step(s)
- converting descendant::*:businessEvent[...] to child steps
- simplifying descendant-or-self step(s)
Is it possible that the index doesn’t get applied?, how could I improve this query?
Yes, the index is not applied, otherwise there would be the line "applying attribute index" in the compilation info. That is because of the calls to `fn:data()` around the attribute steps in the `where` clause. These are unnecessary and hide the attribute comparisons from the optimizer. Removing them triggers the optimization in my tests.
Hope that helps, cheers, Leo
BTW: The XML example you posted looks quite "enterprisey" and ignores quite some XML best-practices. Its structure could easily confuse the query optimizer because of all the attribute/text comparisons. If you have any influence on the structure of your data, it would probably be beneficial to encode the `@name` attributes as element names, which are currently all the same anyways. The result could look something like this:
<data> <id>229388252</id> <matriculation>2751 XTE</matriculation> <clientId>XTETVTTJKC</clientId> <driver> <name>driver name</name> <surname>driver surname</surname> <nif>21973887X</nif> <line1>line 1</line1> <line2>Line 2</line2> <city>City name</city> <cp>28197</cp> <region>Region7</region> <country>Country</country> </driver> <accident> <date> <year>2012</year> <month>13</month> <day>22</day> <hour>2</hour> <minute>53</minute> </date> <address> <line1>Avenue 5</line1> <line2>Line2</line2> <city>Madrid</city> <cp>28224</cp> <region>Region3</region> <country>Spain</country> </address> <declaration>Lorem ipsum dolor ....</declaration> </accident> </data>
basex-talk@mailman.uni-konstanz.de