Join operation and the database

List overview All Threads
Download

newer

older

Re: [basex-talk] Join operation...

Access output:media-type value...

Giuseppe Celano

27 Jul 2017 27 Jul '17

7:48 a.m.

Hi,

I performed join operations between many files and a dictionary. The files contain tokenized texts, where one finds word forms + fine-grained POS tags. Look at the following file:

https://raw.githubusercontent.com/gcelano/POStaggedAncientGreekXML/master/te... https://raw.githubusercontent.com/gcelano/POStaggedAncientGreekXML/master/texts/tlg0001.tlg001.perseus-grc2.xml

The dictionary, which contains word forms + fine-grained POS tags + lemmas, can be found here:

https://github.com/gcelano/LemmatizedAncientGreekXML/tree/master/uniqueToken... https://github.com/gcelano/LemmatizedAncientGreekXML/tree/master/uniqueTokens/values

I created a database for the dictionary and wrote a query (here simplified) like the following:

for $t in $s/t (: t are the tokens in the file containing the tokens :) let $match := $lemm//d[./p = $t/@o and ./f = $t/text()] (: $lemm//d are the single entries in the dictionary :) return $match

I see that if I use this query, it is slow, as if the processor cannot use the database indexes (./p and ./f). The situation does not seem to improve with ./p/text() and ./f/text(), which I would assume to be equivalent to the former because of atomization. On the contrary, if the same information contained in ./p and ./f are merged together and put in an attribute (see @v in the dictionary files) and this is compared against the values in the text (after concatenating them properly), the join operation is super fast (i.e., the index for the values in the attributes are used by BaseX).

Does anyone know why? I have been able to get my results via the above (slow) comparison, but I would like to know what the cause of the problem was, if possible. Thanks.

Best, Giuseppe

Universität Leipzig Institute of Computer Science, Digital Humanities Augustusplatz 10 04109 Leipzig Deutschland E-mail: celano@informatik.uni-leipzig.de E-mail: giuseppegacelano@gmail.com Web site 1: http://www.dh.uni-leipzig.de/wo/team/ Web site 2: https://sites.google.com/site/giuseppegacelano/

Attachments:

attachment.html (text/html — 10.8 KB)

Show replies by date

Christian Grün

27 Jul 27 Jul

7:56 a.m.

Hi Giuseppe,

It would be interesting to see how you declare $t and $lemm in your query, as this might influence the way how your query is rewritten. Could you possibly attach yet another query that can be successfully parsed?

Thanks in advance, Christian

PS: Glad to see that sending mails to the list does work now.

On Thu, Jul 27, 2017 at 1:48 PM, Giuseppe Celano celano@informatik.uni-leipzig.de wrote:

...

Hi,

I performed join operations between many files and a dictionary. The files contain tokenized texts, where one finds word forms + fine-grained POS tags. Look at the following file:

https://raw.githubusercontent.com/gcelano/POStaggedAncientGreekXML/master/te...

The dictionary, which contains word forms + fine-grained POS tags + lemmas, can be found here:

https://github.com/gcelano/LemmatizedAncientGreekXML/tree/master/uniqueToken...

I created a database for the dictionary and wrote a query (here simplified) like the following:

for $t in $s/t (: t are the tokens in the file containing the tokens :) let $match := $lemm//d[./p = $t/@o and ./f = $t/text()] (: $lemm//d are the single entries in the dictionary :) return $match

I see that if I use this query, it is slow, as if the processor cannot use the database indexes (./p and ./f). The situation does not seem to improve with ./p/text() and ./f/text(), which I would assume to be equivalent to the former because of atomization. On the contrary, if the same information contained in ./p and ./f are merged together and put in an attribute (see @v in the dictionary files) and this is compared against the values in the text (after concatenating them properly), the join operation is super fast (i.e., the index for the values in the attributes are used by BaseX).

Does anyone know why? I have been able to get my results via the above (slow) comparison, but I would like to know what the cause of the problem was, if possible. Thanks.

Best, Giuseppe

Universität Leipzig Institute of Computer Science, Digital Humanities Augustusplatz 10 04109 Leipzig Deutschland E-mail: celano@informatik.uni-leipzig.de E-mail: giuseppegacelano@gmail.com Web site 1: http://www.dh.uni-leipzig.de/wo/team/ Web site 2: https://sites.google.com/site/giuseppegacelano/

Giuseppe Celano

8:10 a.m.

Hi Christian,

These are the queries:

(: This works :)

declare variable $txts := doc("tlg0001.tlg001.perseus-grc2.xml"); declare variable $lemm := db:open("splitted-db"); (: see link sent earlier :) for $t in $txts//t let $match := $lemm//d[./@v = $t/@o || "#" || $t/text()] return $match

(: This does not work :)

declare variable $txts := doc("tlg0001.tlg001.perseus-grc2.xml"); declare variable $lemm := db:open("splitted-db"); for $t in $txts//t let $match := $lemm//d[./p = $t/@o and ./f = $t/text()] return $match

...

On 27 Jul 2017, at 13:48, Giuseppe Celano celano@informatik.uni-leipzig.de wrote:

Hi,

I performed join operations between many files and a dictionary. The files contain tokenized texts, where one finds word forms + fine-grained POS tags. Look at the following file:

https://raw.githubusercontent.com/gcelano/POStaggedAncientGreekXML/master/te... https://raw.githubusercontent.com/gcelano/POStaggedAncientGreekXML/master/texts/tlg0001.tlg001.perseus-grc2.xml

The dictionary, which contains word forms + fine-grained POS tags + lemmas, can be found here:

https://github.com/gcelano/LemmatizedAncientGreekXML/tree/master/uniqueToken... https://github.com/gcelano/LemmatizedAncientGreekXML/tree/master/uniqueTokens/values

I created a database for the dictionary and wrote a query (here simplified) like the following:

for $t in $s/t (: t are the tokens in the file containing the tokens :) let $match := $lemm//d[./p = $t/@o and ./f = $t/text()] (: $lemm//d are the single entries in the dictionary :) return $match

I see that if I use this query, it is slow, as if the processor cannot use the database indexes (./p and ./f). The situation does not seem to improve with ./p/text() and ./f/text(), which I would assume to be equivalent to the former because of atomization. On the contrary, if the same information contained in ./p and ./f are merged together and put in an attribute (see @v in the dictionary files) and this is compared against the values in the text (after concatenating them properly), the join operation is super fast (i.e., the index for the values in the attributes are used by BaseX).

Does anyone know why? I have been able to get my results via the above (slow) comparison, but I would like to know what the cause of the problem was, if possible. Thanks.

Best, Giuseppe

Universität Leipzig Institute of Computer Science, Digital Humanities Augustusplatz 10 04109 Leipzig Deutschland E-mail: celano@informatik.uni-leipzig.de mailto:celano@informatik.uni-leipzig.de E-mail: giuseppegacelano@gmail.com mailto:giuseppegacelano@gmail.com Web site 1: http://www.dh.uni-leipzig.de/wo/team/ http://www.dh.uni-leipzig.de/wo/team/ Web site 2: https://sites.google.com/site/giuseppegacelano/ https://sites.google.com/site/giuseppegacelano/

Christian Grün

8:31 a.m.

Hi Giuseppe,

Thanks for the new query.

If you have a look at the query info, you will see that your query will in fact be rewritten to take advantage from the index structures:

for $t_2 in document-node {"tlg0001.tlg001.perseus-grc2.xml"}/*:text/*:s/*:t return db:text("splitted-db", $t_2/@*:o)/parent::*:p/parent::*:d[(*:f = $t_2/text())]

As your input document contains 45.667 texts, however, 45.667 index lookups will need to be performed, and this can take a while if the index results have a low selectivity.

However, there’s a chance to speed up your query. You have two competing index candidates:

let $match := $lemm//d[./p = $t/@o and ./f = $t/text()]

As it is not possible to statically assess which one will be faster, the first candidate will be rewritten to an index request. In your specific case, you will get much better performing by moving the first comparison to the first place:

let $match := $lemm//d[./f = $t/text() and ./p = $t/@o]

Here is a short version of your query that takes around 10 seconds on my machine (it doesn’t really matter if you move the tests in separate predicates):

declare variable $txts := doc("tlg0001.tlg001.perseus-grc2.xml"); declare variable $lemm := db:open("splitted-db"); for $t in $txts//t return $lemm//d[f = $t][p = $t/@o]

One obvious alternative (that we already discussed offline) is to store repeatedly accessed values in a map. This way, you can get evaluation times less than a second.

Hope this helps, Christian

On Thu, Jul 27, 2017 at 2:10 PM, Giuseppe Celano celano@informatik.uni-leipzig.de wrote:

...

Hi Christian,

These are the queries:

(: This works :)

declare variable $txts := doc("tlg0001.tlg001.perseus-grc2.xml"); declare variable $lemm := db:open("splitted-db"); (: see link sent earlier :) for $t in $txts//t let $match := $lemm//d[./@v = $t/@o || "#" || $t/text()] return $match

(: This does not work :)

declare variable $txts := doc("tlg0001.tlg001.perseus-grc2.xml"); declare variable $lemm := db:open("splitted-db"); for $t in $txts//t let $match := $lemm//d[./p = $t/@o and ./f = $t/text()] return $match

Universität Leipzig Institute of Computer Science, Digital Humanities Augustusplatz 10 04109 Leipzig Deutschland E-mail: celano@informatik.uni-leipzig.de E-mail: giuseppegacelano@gmail.com Web site 1: http://www.dh.uni-leipzig.de/wo/team/ Web site 2: https://sites.google.com/site/giuseppegacelano/

On 27 Jul 2017, at 13:48, Giuseppe Celano celano@informatik.uni-leipzig.de wrote:

Hi,

I performed join operations between many files and a dictionary. The files contain tokenized texts, where one finds word forms + fine-grained POS tags. Look at the following file:

https://raw.githubusercontent.com/gcelano/POStaggedAncientGreekXML/master/te...

The dictionary, which contains word forms + fine-grained POS tags + lemmas, can be found here:

https://github.com/gcelano/LemmatizedAncientGreekXML/tree/master/uniqueToken...

I created a database for the dictionary and wrote a query (here simplified) like the following:

for $t in $s/t (: t are the tokens in the file containing the tokens :) let $match := $lemm//d[./p = $t/@o and ./f = $t/text()] (: $lemm//d are the single entries in the dictionary :) return $match

I see that if I use this query, it is slow, as if the processor cannot use the database indexes (./p and ./f). The situation does not seem to improve with ./p/text() and ./f/text(), which I would assume to be equivalent to the former because of atomization. On the contrary, if the same information contained in ./p and ./f are merged together and put in an attribute (see @v in the dictionary files) and this is compared against the values in the text (after concatenating them properly), the join operation is super fast (i.e., the index for the values in the attributes are used by BaseX).

Does anyone know why? I have been able to get my results via the above (slow) comparison, but I would like to know what the cause of the problem was, if possible. Thanks.

Best, Giuseppe

Universität Leipzig Institute of Computer Science, Digital Humanities Augustusplatz 10 04109 Leipzig Deutschland E-mail: celano@informatik.uni-leipzig.de E-mail: giuseppegacelano@gmail.com Web site 1: http://www.dh.uni-leipzig.de/wo/team/ Web site 2: https://sites.google.com/site/giuseppegacelano/

Giuseppe Celano

9:34 a.m.

Hi Christian,

Let's recapitulate:

If I compare values just using one indexed string (the one in @v), this is the fastest way (about one second on my machine).

If I compare against two distinct indexed values, their order matters, in that -if I understand correctly- the database uses the index only(?) for the first values.

I see that [p = $t/@o and f = $t] is much slower than [f = $t and p = $t/@o]. I calculated that on average f contains about 8 characters while p always contains 9. However, (Ancient Greek) characters in f are heavier ( 2 or 3 bytes each) than the (Latin) ones in p (1 byte each). Can this be the reason why [f = $t and p = $t/@o] is evaluated faster?

Best, Giuseppe

...

On 27 Jul 2017, at 14:31, Christian Grün christian.gruen@gmail.com wrote:

Hi Giuseppe,

Thanks for the new query.

If you have a look at the query info, you will see that your query will in fact be rewritten to take advantage from the index structures:

for $t_2 in document-node {"tlg0001.tlg001.perseus-grc2.xml"}/*:text/*:s/*:t return db:text("splitted-db", $t_2/@*:o)/parent::*:p/parent::*:d[(*:f = $t_2/text())]

As your input document contains 45.667 texts, however, 45.667 index lookups will need to be performed, and this can take a while if the index results have a low selectivity.

However, there’s a chance to speed up your query. You have two competing index candidates:

let $match := $lemm//d[./p = $t/@o and ./f = $t/text()]

As it is not possible to statically assess which one will be faster, the first candidate will be rewritten to an index request. In your specific case, you will get much better performing by moving the first comparison to the first place:

let $match := $lemm//d[./f = $t/text() and ./p = $t/@o]

Here is a short version of your query that takes around 10 seconds on my machine (it doesn’t really matter if you move the tests in separate predicates):

declare variable $txts := doc("tlg0001.tlg001.perseus-grc2.xml"); declare variable $lemm := db:open("splitted-db"); for $t in $txts//t return $lemm//d[f = $t][p = $t/@o]

One obvious alternative (that we already discussed offline) is to store repeatedly accessed values in a map. This way, you can get evaluation times less than a second.

Hope this helps, Christian

On Thu, Jul 27, 2017 at 2:10 PM, Giuseppe Celano celano@informatik.uni-leipzig.de wrote:

...
Hi Christian,

These are the queries:

(: This works :)

declare variable $txts := doc("tlg0001.tlg001.perseus-grc2.xml"); declare variable $lemm := db:open("splitted-db"); (: see link sent earlier :) for $t in $txts//t let $match := $lemm//d[./@v = $t/@o || "#" || $t/text()] return $match

(: This does not work :)

declare variable $txts := doc("tlg0001.tlg001.perseus-grc2.xml"); declare variable $lemm := db:open("splitted-db"); for $t in $txts//t let $match := $lemm//d[./p = $t/@o and ./f = $t/text()] return $match

Universität Leipzig Institute of Computer Science, Digital Humanities Augustusplatz 10 04109 Leipzig Deutschland E-mail: celano@informatik.uni-leipzig.de E-mail: giuseppegacelano@gmail.com Web site 1: http://www.dh.uni-leipzig.de/wo/team/ Web site 2: https://sites.google.com/site/giuseppegacelano/

On 27 Jul 2017, at 13:48, Giuseppe Celano celano@informatik.uni-leipzig.de wrote:

Hi,

I performed join operations between many files and a dictionary. The files contain tokenized texts, where one finds word forms + fine-grained POS tags. Look at the following file:

https://raw.githubusercontent.com/gcelano/POStaggedAncientGreekXML/master/te...

The dictionary, which contains word forms + fine-grained POS tags + lemmas, can be found here:

https://github.com/gcelano/LemmatizedAncientGreekXML/tree/master/uniqueToken...

I created a database for the dictionary and wrote a query (here simplified) like the following:

for $t in $s/t (: t are the tokens in the file containing the tokens :) let $match := $lemm//d[./p = $t/@o and ./f = $t/text()] (: $lemm//d are the single entries in the dictionary :) return $match

I see that if I use this query, it is slow, as if the processor cannot use the database indexes (./p and ./f). The situation does not seem to improve with ./p/text() and ./f/text(), which I would assume to be equivalent to the former because of atomization. On the contrary, if the same information contained in ./p and ./f are merged together and put in an attribute (see @v in the dictionary files) and this is compared against the values in the text (after concatenating them properly), the join operation is super fast (i.e., the index for the values in the attributes are used by BaseX).

Does anyone know why? I have been able to get my results via the above (slow) comparison, but I would like to know what the cause of the problem was, if possible. Thanks.

Best, Giuseppe

Universität Leipzig Institute of Computer Science, Digital Humanities Augustusplatz 10 04109 Leipzig Deutschland E-mail: celano@informatik.uni-leipzig.de E-mail: giuseppegacelano@gmail.com Web site 1: http://www.dh.uni-leipzig.de/wo/team/ Web site 2: https://sites.google.com/site/giuseppegacelano/

2964

Age (days ago)

2964

Last active (days ago)

basex-talk@mailman.uni-konstanz.de

4 comments

2 participants

tags (0)

participants (2)

Christian Grün
Giuseppe Celano