Hi,
I performed join operations between many files and a dictionary. The files contain tokenized texts, where one finds word forms + fine-grained POS tags. Look at the following file:
https://raw.githubusercontent.com/gcelano/POStaggedAncientGreekXML/master/te... https://raw.githubusercontent.com/gcelano/POStaggedAncientGreekXML/master/texts/tlg0001.tlg001.perseus-grc2.xml
The dictionary, which contains word forms + fine-grained POS tags + lemmas, can be found here:
https://github.com/gcelano/LemmatizedAncientGreekXML/tree/master/uniqueToken... https://github.com/gcelano/LemmatizedAncientGreekXML/tree/master/uniqueTokens/values
I created a database for the dictionary and wrote a query (here simplified) like the following:
for $t in $s/t (: t are the tokens in the file containing the tokens :) let $match := $lemm//d[./p = $t/@o and ./f = $t/text()] (: $lemm//d are the single entries in the dictionary :) return $match
I see that if I use this query, it is slow, as if the processor cannot use the database indexes (./p and ./f). The situation does not seem to improve with ./p/text() and ./f/text(), which I would assume to be equivalent to the former because of atomization. On the contrary, if the same information contained in ./p and ./f are merged together and put in an attribute (see @v in the dictionary files) and this is compared against the values in the text (after concatenating them properly), the join operation is super fast (i.e., the index for the values in the attributes are used by BaseX).
Does anyone know why? I have been able to get my results via the above (slow) comparison, but I would like to know what the cause of the problem was, if possible. Thanks.
Best, Giuseppe
Universität Leipzig Institute of Computer Science, Digital Humanities Augustusplatz 10 04109 Leipzig Deutschland E-mail: celano@informatik.uni-leipzig.de E-mail: giuseppegacelano@gmail.com Web site 1: http://www.dh.uni-leipzig.de/wo/team/ Web site 2: https://sites.google.com/site/giuseppegacelano/
Hi Giuseppe,
It would be interesting to see how you declare $t and $lemm in your query, as this might influence the way how your query is rewritten. Could you possibly attach yet another query that can be successfully parsed?
Thanks in advance, Christian
PS: Glad to see that sending mails to the list does work now.
On Thu, Jul 27, 2017 at 1:48 PM, Giuseppe Celano celano@informatik.uni-leipzig.de wrote:
Hi,
I performed join operations between many files and a dictionary. The files contain tokenized texts, where one finds word forms + fine-grained POS tags. Look at the following file:
https://raw.githubusercontent.com/gcelano/POStaggedAncientGreekXML/master/te...
The dictionary, which contains word forms + fine-grained POS tags + lemmas, can be found here:
https://github.com/gcelano/LemmatizedAncientGreekXML/tree/master/uniqueToken...
I created a database for the dictionary and wrote a query (here simplified) like the following:
for $t in $s/t (: t are the tokens in the file containing the tokens :) let $match := $lemm//d[./p = $t/@o and ./f = $t/text()] (: $lemm//d are the single entries in the dictionary :) return $match
I see that if I use this query, it is slow, as if the processor cannot use the database indexes (./p and ./f). The situation does not seem to improve with ./p/text() and ./f/text(), which I would assume to be equivalent to the former because of atomization. On the contrary, if the same information contained in ./p and ./f are merged together and put in an attribute (see @v in the dictionary files) and this is compared against the values in the text (after concatenating them properly), the join operation is super fast (i.e., the index for the values in the attributes are used by BaseX).
Does anyone know why? I have been able to get my results via the above (slow) comparison, but I would like to know what the cause of the problem was, if possible. Thanks.
Best, Giuseppe
Universität Leipzig Institute of Computer Science, Digital Humanities Augustusplatz 10 04109 Leipzig Deutschland E-mail: celano@informatik.uni-leipzig.de E-mail: giuseppegacelano@gmail.com Web site 1: http://www.dh.uni-leipzig.de/wo/team/ Web site 2: https://sites.google.com/site/giuseppegacelano/
Hi Christian,
These are the queries:
(: This works :)
declare variable $txts := doc("tlg0001.tlg001.perseus-grc2.xml"); declare variable $lemm := db:open("splitted-db"); (: see link sent earlier :) for $t in $txts//t let $match := $lemm//d[./@v = $t/@o || "#" || $t/text()] return $match
(: This does not work :)
declare variable $txts := doc("tlg0001.tlg001.perseus-grc2.xml"); declare variable $lemm := db:open("splitted-db"); for $t in $txts//t let $match := $lemm//d[./p = $t/@o and ./f = $t/text()] return $match
Universität Leipzig Institute of Computer Science, Digital Humanities Augustusplatz 10 04109 Leipzig Deutschland E-mail: celano@informatik.uni-leipzig.de E-mail: giuseppegacelano@gmail.com Web site 1: http://www.dh.uni-leipzig.de/wo/team/ Web site 2: https://sites.google.com/site/giuseppegacelano/
On 27 Jul 2017, at 13:48, Giuseppe Celano celano@informatik.uni-leipzig.de wrote:
Hi,
I performed join operations between many files and a dictionary. The files contain tokenized texts, where one finds word forms + fine-grained POS tags. Look at the following file:
https://raw.githubusercontent.com/gcelano/POStaggedAncientGreekXML/master/te... https://raw.githubusercontent.com/gcelano/POStaggedAncientGreekXML/master/texts/tlg0001.tlg001.perseus-grc2.xml
The dictionary, which contains word forms + fine-grained POS tags + lemmas, can be found here:
https://github.com/gcelano/LemmatizedAncientGreekXML/tree/master/uniqueToken... https://github.com/gcelano/LemmatizedAncientGreekXML/tree/master/uniqueTokens/values
I created a database for the dictionary and wrote a query (here simplified) like the following:
for $t in $s/t (: t are the tokens in the file containing the tokens :) let $match := $lemm//d[./p = $t/@o and ./f = $t/text()] (: $lemm//d are the single entries in the dictionary :) return $match
I see that if I use this query, it is slow, as if the processor cannot use the database indexes (./p and ./f). The situation does not seem to improve with ./p/text() and ./f/text(), which I would assume to be equivalent to the former because of atomization. On the contrary, if the same information contained in ./p and ./f are merged together and put in an attribute (see @v in the dictionary files) and this is compared against the values in the text (after concatenating them properly), the join operation is super fast (i.e., the index for the values in the attributes are used by BaseX).
Does anyone know why? I have been able to get my results via the above (slow) comparison, but I would like to know what the cause of the problem was, if possible. Thanks.
Best, Giuseppe
Universität Leipzig Institute of Computer Science, Digital Humanities Augustusplatz 10 04109 Leipzig Deutschland E-mail: celano@informatik.uni-leipzig.de mailto:celano@informatik.uni-leipzig.de E-mail: giuseppegacelano@gmail.com mailto:giuseppegacelano@gmail.com Web site 1: http://www.dh.uni-leipzig.de/wo/team/ http://www.dh.uni-leipzig.de/wo/team/ Web site 2: https://sites.google.com/site/giuseppegacelano/ https://sites.google.com/site/giuseppegacelano/
Hi Giuseppe,
Thanks for the new query.
If you have a look at the query info, you will see that your query will in fact be rewritten to take advantage from the index structures:
for $t_2 in document-node {"tlg0001.tlg001.perseus-grc2.xml"}/*:text/*:s/*:t return db:text("splitted-db", $t_2/@*:o)/parent::*:p/parent::*:d[(*:f = $t_2/text())]
As your input document contains 45.667 texts, however, 45.667 index lookups will need to be performed, and this can take a while if the index results have a low selectivity.
However, there’s a chance to speed up your query. You have two competing index candidates:
let $match := $lemm//d[./p = $t/@o and ./f = $t/text()]
As it is not possible to statically assess which one will be faster, the first candidate will be rewritten to an index request. In your specific case, you will get much better performing by moving the first comparison to the first place:
let $match := $lemm//d[./f = $t/text() and ./p = $t/@o]
Here is a short version of your query that takes around 10 seconds on my machine (it doesn’t really matter if you move the tests in separate predicates):
declare variable $txts := doc("tlg0001.tlg001.perseus-grc2.xml"); declare variable $lemm := db:open("splitted-db"); for $t in $txts//t return $lemm//d[f = $t][p = $t/@o]
One obvious alternative (that we already discussed offline) is to store repeatedly accessed values in a map. This way, you can get evaluation times less than a second.
Hope this helps, Christian
On Thu, Jul 27, 2017 at 2:10 PM, Giuseppe Celano celano@informatik.uni-leipzig.de wrote:
Hi Christian,
These are the queries:
(: This works :)
declare variable $txts := doc("tlg0001.tlg001.perseus-grc2.xml"); declare variable $lemm := db:open("splitted-db"); (: see link sent earlier :) for $t in $txts//t let $match := $lemm//d[./@v = $t/@o || "#" || $t/text()] return $match
(: This does not work :)
declare variable $txts := doc("tlg0001.tlg001.perseus-grc2.xml"); declare variable $lemm := db:open("splitted-db"); for $t in $txts//t let $match := $lemm//d[./p = $t/@o and ./f = $t/text()] return $match
Universität Leipzig Institute of Computer Science, Digital Humanities Augustusplatz 10 04109 Leipzig Deutschland E-mail: celano@informatik.uni-leipzig.de E-mail: giuseppegacelano@gmail.com Web site 1: http://www.dh.uni-leipzig.de/wo/team/ Web site 2: https://sites.google.com/site/giuseppegacelano/
On 27 Jul 2017, at 13:48, Giuseppe Celano celano@informatik.uni-leipzig.de wrote:
Hi,
I performed join operations between many files and a dictionary. The files contain tokenized texts, where one finds word forms + fine-grained POS tags. Look at the following file:
https://raw.githubusercontent.com/gcelano/POStaggedAncientGreekXML/master/te...
The dictionary, which contains word forms + fine-grained POS tags + lemmas, can be found here:
https://github.com/gcelano/LemmatizedAncientGreekXML/tree/master/uniqueToken...
I created a database for the dictionary and wrote a query (here simplified) like the following:
for $t in $s/t (: t are the tokens in the file containing the tokens :) let $match := $lemm//d[./p = $t/@o and ./f = $t/text()] (: $lemm//d are the single entries in the dictionary :) return $match
I see that if I use this query, it is slow, as if the processor cannot use the database indexes (./p and ./f). The situation does not seem to improve with ./p/text() and ./f/text(), which I would assume to be equivalent to the former because of atomization. On the contrary, if the same information contained in ./p and ./f are merged together and put in an attribute (see @v in the dictionary files) and this is compared against the values in the text (after concatenating them properly), the join operation is super fast (i.e., the index for the values in the attributes are used by BaseX).
Does anyone know why? I have been able to get my results via the above (slow) comparison, but I would like to know what the cause of the problem was, if possible. Thanks.
Best, Giuseppe
Universität Leipzig Institute of Computer Science, Digital Humanities Augustusplatz 10 04109 Leipzig Deutschland E-mail: celano@informatik.uni-leipzig.de E-mail: giuseppegacelano@gmail.com Web site 1: http://www.dh.uni-leipzig.de/wo/team/ Web site 2: https://sites.google.com/site/giuseppegacelano/
Hi Christian,
Let's recapitulate:
If I compare values just using one indexed string (the one in @v), this is the fastest way (about one second on my machine).
If I compare against two distinct indexed values, their order matters, in that -if I understand correctly- the database uses the index only(?) for the first values.
I see that [p = $t/@o and f = $t] is much slower than [f = $t and p = $t/@o]. I calculated that on average f contains about 8 characters while p always contains 9. However, (Ancient Greek) characters in f are heavier ( 2 or 3 bytes each) than the (Latin) ones in p (1 byte each). Can this be the reason why [f = $t and p = $t/@o] is evaluated faster?
Best, Giuseppe
Universität Leipzig Institute of Computer Science, Digital Humanities Augustusplatz 10 04109 Leipzig Deutschland E-mail: celano@informatik.uni-leipzig.de E-mail: giuseppegacelano@gmail.com Web site 1: http://www.dh.uni-leipzig.de/wo/team/ Web site 2: https://sites.google.com/site/giuseppegacelano/
On 27 Jul 2017, at 14:31, Christian Grün christian.gruen@gmail.com wrote:
Hi Giuseppe,
Thanks for the new query.
If you have a look at the query info, you will see that your query will in fact be rewritten to take advantage from the index structures:
for $t_2 in document-node {"tlg0001.tlg001.perseus-grc2.xml"}/*:text/*:s/*:t return db:text("splitted-db", $t_2/@*:o)/parent::*:p/parent::*:d[(*:f = $t_2/text())]
As your input document contains 45.667 texts, however, 45.667 index lookups will need to be performed, and this can take a while if the index results have a low selectivity.
However, there’s a chance to speed up your query. You have two competing index candidates:
let $match := $lemm//d[./p = $t/@o and ./f = $t/text()]
As it is not possible to statically assess which one will be faster, the first candidate will be rewritten to an index request. In your specific case, you will get much better performing by moving the first comparison to the first place:
let $match := $lemm//d[./f = $t/text() and ./p = $t/@o]
Here is a short version of your query that takes around 10 seconds on my machine (it doesn’t really matter if you move the tests in separate predicates):
declare variable $txts := doc("tlg0001.tlg001.perseus-grc2.xml"); declare variable $lemm := db:open("splitted-db"); for $t in $txts//t return $lemm//d[f = $t][p = $t/@o]
One obvious alternative (that we already discussed offline) is to store repeatedly accessed values in a map. This way, you can get evaluation times less than a second.
Hope this helps, Christian
On Thu, Jul 27, 2017 at 2:10 PM, Giuseppe Celano celano@informatik.uni-leipzig.de wrote:
Hi Christian,
These are the queries:
(: This works :)
declare variable $txts := doc("tlg0001.tlg001.perseus-grc2.xml"); declare variable $lemm := db:open("splitted-db"); (: see link sent earlier :) for $t in $txts//t let $match := $lemm//d[./@v = $t/@o || "#" || $t/text()] return $match
(: This does not work :)
declare variable $txts := doc("tlg0001.tlg001.perseus-grc2.xml"); declare variable $lemm := db:open("splitted-db"); for $t in $txts//t let $match := $lemm//d[./p = $t/@o and ./f = $t/text()] return $match
Universität Leipzig Institute of Computer Science, Digital Humanities Augustusplatz 10 04109 Leipzig Deutschland E-mail: celano@informatik.uni-leipzig.de E-mail: giuseppegacelano@gmail.com Web site 1: http://www.dh.uni-leipzig.de/wo/team/ Web site 2: https://sites.google.com/site/giuseppegacelano/
On 27 Jul 2017, at 13:48, Giuseppe Celano celano@informatik.uni-leipzig.de wrote:
Hi,
I performed join operations between many files and a dictionary. The files contain tokenized texts, where one finds word forms + fine-grained POS tags. Look at the following file:
https://raw.githubusercontent.com/gcelano/POStaggedAncientGreekXML/master/te...
The dictionary, which contains word forms + fine-grained POS tags + lemmas, can be found here:
https://github.com/gcelano/LemmatizedAncientGreekXML/tree/master/uniqueToken...
I created a database for the dictionary and wrote a query (here simplified) like the following:
for $t in $s/t (: t are the tokens in the file containing the tokens :) let $match := $lemm//d[./p = $t/@o and ./f = $t/text()] (: $lemm//d are the single entries in the dictionary :) return $match
I see that if I use this query, it is slow, as if the processor cannot use the database indexes (./p and ./f). The situation does not seem to improve with ./p/text() and ./f/text(), which I would assume to be equivalent to the former because of atomization. On the contrary, if the same information contained in ./p and ./f are merged together and put in an attribute (see @v in the dictionary files) and this is compared against the values in the text (after concatenating them properly), the join operation is super fast (i.e., the index for the values in the attributes are used by BaseX).
Does anyone know why? I have been able to get my results via the above (slow) comparison, but I would like to know what the cause of the problem was, if possible. Thanks.
Best, Giuseppe
Universität Leipzig Institute of Computer Science, Digital Humanities Augustusplatz 10 04109 Leipzig Deutschland E-mail: celano@informatik.uni-leipzig.de E-mail: giuseppegacelano@gmail.com Web site 1: http://www.dh.uni-leipzig.de/wo/team/ Web site 2: https://sites.google.com/site/giuseppegacelano/
basex-talk@mailman.uni-konstanz.de