Hi,
I'm trying to understand why my Basex application is slow.
I have a database looking like this:
<collection> <entry time="2012-03-04T17:43:29"> <node>4119300</node> <query>[text() contains text ('Bank' ftand 'fallen') using stemming using language "de" distance at most 6 words ordered]</query> <person>marcel</person> <phraseme>Ad0032</phraseme> <selected>no</selected> </entry> <entry time="2012-03-04T17:43:29"> <node>11150403</node> <query>[text() contains text ('Bank' ftand 'fallen') using stemming using language "de" distance at most 6 words ordered]</query> <person>marcel</person> <phraseme>Ad0032</phraseme> <selected>no</selected> </entry> <entry time="2012-03-04T17:43:29"> <node>17335179</node> <query>[text() contains text ('Bank' ftand 'fallen') using stemming using language "de" distance at most 6 words ordered]</query> <person>marcel</person> <phraseme>Ad0032</phraseme> <selected>yes</selected> </entry> </collection>
It consists of 97500 entries stored in the database "collect", one third has "yes" as value for <selected>, the other two third have "no". The number of entries will probably double over time.
I use a CGI script to produce a HTML page first listing the total number of "yes" entries and the total number of distinct phrasemes, and then listing all entries sorted by <phraseme> where <selected> is "yes" in a table with 4 columns (phraseme, distinct persons, number of entries with this phraseme, link to another CGI-Perl script). Additionally, after the table I show the last timestamp for an entry with <selected> "yes".
I use this for controlling purposes, to track progress of the use of the actual Basex search application.
I put the relevant CGI code at the bottom. It is not that complex, but it takes 80 to 90 seconds. Which is much to slow! Skipping the timestamp information does not improve the speed.
Do you have an idea how to improve this? Is the slow processing due to badly constructed XQueries, due to rendering as HTML table, due to server issues (I have a virtual server, but I don't know who else is using it for what)?
my $session = Session->new("localhost", 1984, "admin", "admin"); $session->execute("open collect"); my $evidencecount = $session->execute("xquery let $results := //selected[text() = 'yes'] return <b>{count($results)}</b>"); my @phrasemes = sort split(/\s+/, $session->execute("xquery distinct-values(//entry/phraseme/text())")); $session->close; my $phrasemecount = $#phrasemes + 1; print "<p> <b>$phrasemecount</b> accessed phrasemes with a total of $evidencecount hits</p>"; print "<table>"; print "<tr><th>Phraseme-ID</th> <th>Person</th><th>Count</th></tr>";
my $query =<<EOF; for $phraseme in distinct-values(//entry/phraseme) let $nodes := //phraseme[text() = $phraseme] let $count := count($nodes[../selected[text() = "yes"]]) let $person := distinct-values($nodes/../person) order by $phraseme return <tr><td>({$phraseme})</td> <td>{$person}</td> <td>{$count}</td> <td><a href="basex-show-phraseme.pl?phraseme={\$phraseme}">show</a></td></tr> EOF
my $viewsession = Session->new("localhost", 1984, "admin", "admin"); $viewsession->execute("open collect"); my $xquery = $viewsession->query($query); print $xquery->execute(); $xquery->close(); print "</table>";
# display last timestamp my $timequery = <<EOF; let $i := //entry/@time order by $i/@time ascending return <p>Last access: {data($i[last()])}</p> EOF
my $xtimequery = $viewsession->query($timequery); print $xtimequery->execute(); $xtimequery->close(); $viewsession->close();
Hi,
if you are just interested in the count of "yes" or "no", you could also try the function index:facets("db", "flat").
-- Andreas
Am 01.10.2012 um 16:45 schrieb Mahlow Cerstin:
Hi,
I'm trying to understand why my Basex application is slow.
I have a database looking like this:
<collection> <entry time="2012-03-04T17:43:29"> <node>4119300</node> <query>[text() contains text ('Bank' ftand 'fallen') using stemming using language "de" distance at most 6 words ordered]</query> <person>marcel</person> <phraseme>Ad0032</phraseme> <selected>no</selected> </entry> <entry time="2012-03-04T17:43:29"> <node>11150403</node> <query>[text() contains text ('Bank' ftand 'fallen') using stemming using language "de" distance at most 6 words ordered]</query> <person>marcel</person> <phraseme>Ad0032</phraseme> <selected>no</selected> </entry> <entry time="2012-03-04T17:43:29"> <node>17335179</node> <query>[text() contains text ('Bank' ftand 'fallen') using stemming using language "de" distance at most 6 words ordered]</query> <person>marcel</person> <phraseme>Ad0032</phraseme> <selected>yes</selected> </entry> </collection>
It consists of 97500 entries stored in the database "collect", one third has "yes" as value for <selected>, the other two third have "no". The number of entries will probably double over time.
I use a CGI script to produce a HTML page first listing the total number of "yes" entries and the total number of distinct phrasemes, and then listing all entries sorted by <phraseme> where <selected> is "yes" in a table with 4 columns (phraseme, distinct persons, number of entries with this phraseme, link to another CGI-Perl script). Additionally, after the table I show the last timestamp for an entry with <selected> "yes".
I use this for controlling purposes, to track progress of the use of the actual Basex search application.
I put the relevant CGI code at the bottom. It is not that complex, but it takes 80 to 90 seconds. Which is much to slow! Skipping the timestamp information does not improve the speed.
Do you have an idea how to improve this? Is the slow processing due to badly constructed XQueries, due to rendering as HTML table, due to server issues (I have a virtual server, but I don't know who else is using it for what)?
my $session = Session->new("localhost", 1984, "admin", "admin"); $session->execute("open collect"); my $evidencecount = $session->execute("xquery let $results := //selected[text() = 'yes'] return <b>{count($results)}</b>"); my @phrasemes = sort split(/\s+/, $session->execute("xquery distinct-values(//entry/phraseme/text())")); $session->close; my $phrasemecount = $#phrasemes + 1; print "<p> <b>$phrasemecount</b> accessed phrasemes with a total of $evidencecount hits</p>"; print "<table>"; print "<tr><th>Phraseme-ID</th> <th>Person</th><th>Count</th></tr>";
my $query =<<EOF; for $phraseme in distinct-values(//entry/phraseme) let $nodes := //phraseme[text() = $phraseme] let $count := count($nodes[../selected[text() = "yes"]]) let $person := distinct-values($nodes/../person) order by $phraseme return
<tr><td>({\$phraseme})</td> <td>{\$person}</td> <td>{\$count}</td> <td><a href="basex-show-phraseme.pl?phraseme={\$phraseme}">show</a></td></tr> EOF
my $viewsession = Session->new("localhost", 1984, "admin", "admin"); $viewsession->execute("open collect"); my $xquery = $viewsession->query($query); print $xquery->execute(); $xquery->close(); print "</table>";
# display last timestamp my $timequery = <<EOF; let $i := //entry/@time order by $i/@time ascending return
<p>Last access: {data(\$i[last()])}</p> EOF
my $xtimequery = $viewsession->query($timequery); print $xtimequery->execute(); $xtimequery->close(); $viewsession->close();
-- Dr. phil. Cerstin Mahlow
Universität Basel Deutsches Seminar Nadelberg 4 4051 Basel Schweiz
Tel: +41 61 267 07 65 Fax: +41 61 267 34 40 Mail: cerstin.mahlow@unibas.ch Web: http://www.oldphras.net
BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk
Hi Andreas,
Am 01.10.2012 um 16:52 schrieb Andreas Weiler:
if you are just interested in the count of "yes" or "no", you could also try the function index:facets("db", "flat").
This returns a document-node structure, how do I access the information I am interested in?
Best regards
Cerstin
Hi Cerstin,
you can just use it as input for further xquery queries, like this:
index:facets("db", "flat")//element[@name = "selected"]/entry[text() = "yes"]/@count/data()
so you get the total number of entries with "yes".
-- Andreas
Am 01.10.2012 um 17:16 schrieb Mahlow Cerstin:
Hi Andreas,
Am 01.10.2012 um 16:52 schrieb Andreas Weiler:
if you are just interested in the count of "yes" or "no", you could also try the function index:facets("db", "flat").
This returns a document-node structure, how do I access the information I am interested in?
Best regards
Cerstin
Dr. phil. Cerstin Mahlow
Universität Basel Deutsches Seminar Nadelberg 4 4051 Basel Schweiz
Tel: +41 61 267 07 65 Fax: +41 61 267 34 40 Mail: cerstin.mahlow@unibas.ch Web: http://www.oldphras.net
BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk
Hi Andreas,
Am 01.10.2012 um 17:25 schrieb Andreas Weiler:
you can just use it as input for further xquery queries, like this:
index:facets("db", "flat")//element[@name = "selected"]/entry[text() = "yes"]/@count/data()
so you get the total number of entries with "yes".
Ah, thanks! Now it works. But almost no speed effect on the whole thing.
Best regards
Cerstin
Hi Cerstin,
can you check each single query contained in the script with the GUI and see how much time each one takes?
Why are you creating a new session for each query? You should be able to take the same session for all queries.
-- Andreas
Am 01.10.2012 um 17:40 schrieb Mahlow Cerstin:
Hi Andreas,
Am 01.10.2012 um 17:25 schrieb Andreas Weiler:
you can just use it as input for further xquery queries, like this:
index:facets("db", "flat")//element[@name = "selected"]/entry[text() = "yes"]/@count/data()
so you get the total number of entries with "yes".
Ah, thanks! Now it works. But almost no speed effect on the whole thing.
Best regards
Cerstin
Dr. phil. Cerstin Mahlow
Universität Basel Deutsches Seminar Nadelberg 4 4051 Basel Schweiz
Tel: +41 61 267 07 65 Fax: +41 61 267 34 40 Mail: cerstin.mahlow@unibas.ch Web: http://www.oldphras.net
BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk
Hi Andreas,
having only one session brings some seconds.
When I run the queries in the GUI on the same server, I get this:
(total number of positive hits) index:facets("collect", "flat")//element[@name = "selected"]/entry[text() = "yes"]/@count/data()
takes 2 to 4 ms
(number of phrasemes searched) count(distinct-values(//entry/phraseme/text()))
takes 370 to 500 ms
(create info table) for $phraseme in distinct-values(//entry/phraseme) let $nodes := //phraseme[text() = $phraseme] let $count := count($nodes[../selected[text() = "yes"]]) let $person := distinct-values($nodes/../person) order by $phraseme return<tr><td>({$phraseme})</td> <td>{$person}</td> <td>{$count}</td><td><a href="basex-show-phraseme.pl?phraseme={$phraseme}">anzeigen/aussortieren</a></td></tr>
takes 1000 to 1400 ms
(last timestamp) let $i := //entry/@time order by $i/@time ascending return <p>Letzte Bearbeitung: {data($i[last()])}</p>
takes 250 to 380 ms
However, I just switched to using count() for the number of phrasemes accessed. Before I took the distinct values, splitted them into an array, and then used the number of indices. And this probably took a lot of time. Using count() and dropping the splitting results in the page showing up in 2 to 3 seconds. Perfect!
Thanks for helping! I will probably soon will ask for help with another slow process :-)
Best regards
Cerstin
-- Dr. phil. Cerstin Mahlow
Universität Basel Departement Sprach- und Literaturwissenschaften Fachbereich Deutsche Sprach- und Literaturwissenschaft Nadelberg 4 4051 Basel Schweiz
Tel: +41 61 267 07 65 Fax: +41 61 267 34 40 Mail: cerstin.mahlow@unibas.ch Web: http://www.oldphras.net ________________________________________ Von: Andreas Weiler [andreas.weiler@uni-konstanz.de] Gesendet: Dienstag, 2. Oktober 2012 10:34 An: Cerstin Elisabeth Mahlow Cc: basex-talk@mailman.uni-konstanz.de Betreff: Re: [basex-talk] slow processing
Hi Cerstin,
can you check each single query contained in the script with the GUI and see how much time each one takes?
Why are you creating a new session for each query? You should be able to take the same session for all queries.
-- Andreas
basex-talk@mailman.uni-konstanz.de