tf/idf

List overview All Threads
Download

newer

older

Puzzling XQuery result

Trigger!

Wiard Vasen

30 Mar 2011 30 Mar '11

4:02 p.m.

Dear sirs of Basex,

I am doing my Master thesis on the letters of Vincent van Gogh at the University of Amsterdam. For that purpose I use BaseX to analyze the letters.

I wonder, is there the possibility to generate a tf/idf score automatically? In your faq I noticed there needs to be a special term like 'SET SCORING 0' to be able to get a tf/idf score.

This information I get from the following page: http://docs.basex.org/wiki/Full-Text

Could you help me with this?

I would be very grateful.

Kind regards,

Attachments:

attachment.html (text/html — 916 bytes)

Show replies by date

Andreas Weiler

31 Mar 31 Mar

2:33 a.m.

Dear Wiard Vasen,

you just need to set the scoring property once. If you work in the GUI:

Go to the top input bar, choose command and type:

set scoring *

as * set the scoring algorithm you like.

In the console just type: set scoring *

After setting this you can use the score function, like in the 8th query of our online demo (basex.org/products/live-demo):

let $names := ('Jack London', 'Jack', 'Jim Beam', 'Jack Daniels') for $name at $pos score $score in $names[. contains text 'Jack'] order by $score descending return <name pos='{ $pos }'>{ $name }</name>

Don't hesitate to ask for more, Andreas

Am 30.03.2011 um 22:02 schrieb Wiard Vasen:

...

Dear sirs of Basex,

I am doing my Master thesis on the letters of Vincent van Gogh at the University of Amsterdam. For that purpose I use BaseX to analyze the letters.

I wonder, is there the possibility to generate a tf/idf score automatically? In your faq I noticed there needs to be a special term like 'SET SCORING 0' to be able to get a tf/idf score.

This information I get from the following page: http://docs.basex.org/wiki/Full-Text

Could you help me with this?

I would be very grateful.

Kind regards, _______________________________________________ BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk

Christian Grün

5:46 a.m.

Hi Wiard,

the tf/idf scoring is only available if you are working with full-text index structures. If you have built a full-text index for your database "DB", the following query will yield different scoring results, depending on the chosen scoring model:

ft:score(db:open("DB")//*[text() contains text 'xml'])

As Andreas indicated, however, you may either set the SCORING property with a db command or explicitly choose the type of scoring in the GUI's database creation dialog (Database -> New -> Fulltext -> TF/IDF Scoring).

Christian

On Thu, Mar 31, 2011 at 8:33 AM, Andreas Weiler andreas.weiler@uni-konstanz.de wrote:

...

Dear Wiard Vasen, you just need to set the scoring property once. If you work in the GUI: Go to the top input bar, choose command and type: set scoring *

as * set the scoring algorithm you like. In the console just type: set scoring * After setting this you can use the score function, like in the 8th query of our online demo (basex.org/products/live-demo): let $names := ('Jack London', 'Jack', 'Jim Beam', 'Jack Daniels') for $name at $pos score $score in $names[. contains text 'Jack'] order by $score descending return <name pos='{ $pos }'>{ $name }</name> Don't hesitate to ask for more, Andreas Am 30.03.2011 um 22:02 schrieb Wiard Vasen:

Dear sirs of Basex, I am doing my Master thesis on the letters of Vincent van Gogh at the University of Amsterdam. For that purpose I use BaseX to analyze the letters. I wonder, is there the possibility to generate a tf/idf score automatically? In your faq I noticed there needs to be a special term like 'SET SCORING 0' to be able to get a tf/idf score. This information I get from the following page: http://docs.basex.org/wiki/Full-Text Could you help me with this? I would be very grateful. Kind regards, _______________________________________________ BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk

BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk

Wiard Vasen

3 Apr 3 Apr

6:42 a.m.

Dear Christian and Andreas,

Thanks for your great help! I used Christians solution: ft:score(db:open("DB")//*[text() contains text 'xml']) And it works fine.

The next step is that I want to get the associated documents with these scores.

Could you help me with this step?

The results I get now is a list with frequencies, without the references to the particular documents. I think what is needed is the tf/idf score.

Regards,

Wiard

2011/3/31 Christian Grün christian.gruen@gmail.com

...

Hi Wiard,

the tf/idf scoring is only available if you are working with full-text index structures. If you have built a full-text index for your database "DB", the following query will yield different scoring results, depending on the chosen scoring model:

ft:score(db:open("DB")//*[text() contains text 'xml'])

As Andreas indicated, however, you may either set the SCORING property with a db command or explicitly choose the type of scoring in the GUI's database creation dialog (Database -> New -> Fulltext -> TF/IDF Scoring).

Christian

On Thu, Mar 31, 2011 at 8:33 AM, Andreas Weiler andreas.weiler@uni-konstanz.de wrote:

...
Dear Wiard Vasen, you just need to set the scoring property once. If you work in the GUI: Go to the top input bar, choose command and type: set scoring *

as * set the scoring algorithm you like. In the console just type: set scoring * After setting this you can use the score function, like in the 8th query

of

...
our online demo (basex.org/products/live-demo): let $names := ('Jack London', 'Jack', 'Jim Beam', 'Jack Daniels') for $name at $pos score $score in $names[. contains text 'Jack'] order by $score descending return <name pos='{ $pos }'>{ $name }</name> Don't hesitate to ask for more, Andreas Am 30.03.2011 um 22:02 schrieb Wiard Vasen:

Dear sirs of Basex, I am doing my Master thesis on the letters of Vincent van Gogh at the University of Amsterdam. For that purpose I use BaseX to analyze the letters. I wonder, is there the possibility to generate a tf/idf score

automatically?

...
In your faq I noticed there needs to be a special term like 'SET SCORING

0'

...
to be able to get a tf/idf score. This information I get from the following page: http://docs.basex.org/wiki/Full-Text Could you help me with this? I would be very grateful. Kind regards, _______________________________________________ BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk

BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk

Andreas Weiler

8 a.m.

Hi Wiard,

you could use the base-uri function of XQuery, like (probably can be done easier):

for $i at $pos in db:open("DB")//* where $i[text() contains text 'xml'] return <hit>{base-uri($i)}<score>{ft:score($i[text() contains text 'xml'])}</score></hit>

-- Andreas

Am 03.04.2011 um 12:42 schrieb Wiard Vasen:

...

Dear Christian and Andreas,

Thanks for your great help! I used Christians solution: ft:score(db:open("DB")//*[text() contains text 'xml']) And it works fine.

The next step is that I want to get the associated documents with these scores.

Could you help me with this step?

The results I get now is a list with frequencies, without the references to the particular documents. I think what is needed is the tf/idf score.

Regards,

Wiard

2011/3/31 Christian Grün christian.gruen@gmail.com Hi Wiard,

the tf/idf scoring is only available if you are working with full-text index structures. If you have built a full-text index for your database "DB", the following query will yield different scoring results, depending on the chosen scoring model:

ft:score(db:open("DB")//*[text() contains text 'xml'])

As Andreas indicated, however, you may either set the SCORING property with a db command or explicitly choose the type of scoring in the GUI's database creation dialog (Database -> New -> Fulltext -> TF/IDF Scoring).

Christian

On Thu, Mar 31, 2011 at 8:33 AM, Andreas Weiler andreas.weiler@uni-konstanz.de wrote:

...
Dear Wiard Vasen, you just need to set the scoring property once. If you work in the GUI: Go to the top input bar, choose command and type: set scoring *

as * set the scoring algorithm you like. In the console just type: set scoring * After setting this you can use the score function, like in the 8th query of our online demo (basex.org/products/live-demo): let $names := ('Jack London', 'Jack', 'Jim Beam', 'Jack Daniels') for $name at $pos score $score in $names[. contains text 'Jack'] order by $score descending return <name pos='{ $pos }'>{ $name }</name> Don't hesitate to ask for more, Andreas Am 30.03.2011 um 22:02 schrieb Wiard Vasen:

Dear sirs of Basex, I am doing my Master thesis on the letters of Vincent van Gogh at the University of Amsterdam. For that purpose I use BaseX to analyze the letters. I wonder, is there the possibility to generate a tf/idf score automatically? In your faq I noticed there needs to be a special term like 'SET SCORING 0' to be able to get a tf/idf score. This information I get from the following page: http://docs.basex.org/wiki/Full-Text Could you help me with this? I would be very grateful. Kind regards, _______________________________________________ BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk

BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk

Wiard Vasen

8:24 a.m.

Hi Andreas,

Wow! This is the complete answer to my question!

I hope you can help me with the next question. Because I am analyzing changes in the artistic life of Van Gogh, I am partitioning the relatively large repository annotated xml files on the basis of residence.

For that reason I need to put a query like:

for $i at $pos in db:open("tfidfbrievenvangogh")//* where $i[text() contains text 'kleur'] return <hit>{base-uri($i)}<score>{ft:score($i[text() contains text 'kleur'])}</score></hit>

with the extension: given the interval, all xml-files betweenlet567.xml and let689.xml. What means that I know that in this partition xml-files Van Gogh was in Arles. And I want to know what is the tf-idf score of the dutch word 'kleur'.

To give a resume of my question: How do I partition the repository in subsets, so that I can produce information on these subsets. And how do I do this in BasX with xquery.

Thanks a lot beforehand!

Kind regards,

Wiard

2011/4/3 Andreas Weiler andreas.weiler@uni-konstanz.de

...

Hi Wiard,

you could use the base-uri function of XQuery, like (probably can be done easier):

for $i at $pos in db:open("DB")//* where $i[text() contains text 'xml'] return <hit>{base-uri($i)}<score>{ft:score($i[text() contains text 'xml'])}</score></hit>

-- Andreas

Am 03.04.2011 um 12:42 schrieb Wiard Vasen:

Dear Christian and Andreas,

Thanks for your great help! I used Christians solution: ft:score(db:open("DB")//*[text() contains text 'xml']) And it works fine.

The next step is that I want to get the associated documents with these scores.

Could you help me with this step?

The results I get now is a list with frequencies, without the references to the particular documents. I think what is needed is the tf/idf score.

Regards,

Wiard

2011/3/31 Christian Grün christian.gruen@gmail.com

...
Hi Wiard,

the tf/idf scoring is only available if you are working with full-text index structures. If you have built a full-text index for your database "DB", the following query will yield different scoring results, depending on the chosen scoring model:

ft:score(db:open("DB")//*[text() contains text 'xml'])

As Andreas indicated, however, you may either set the SCORING property with a db command or explicitly choose the type of scoring in the GUI's database creation dialog (Database -> New -> Fulltext -> TF/IDF Scoring).

Christian

On Thu, Mar 31, 2011 at 8:33 AM, Andreas Weiler andreas.weiler@uni-konstanz.de wrote:

...
Dear Wiard Vasen, you just need to set the scoring property once. If you work in the GUI: Go to the top input bar, choose command and type: set scoring *

as * set the scoring algorithm you like. In the console just type: set scoring * After setting this you can use the score function, like in the 8th query

of

...
our online demo (basex.org/products/live-demo): let $names := ('Jack London', 'Jack', 'Jim Beam', 'Jack Daniels') for $name at $pos score $score in $names[. contains text 'Jack'] order by $score descending return <name pos='{ $pos }'>{ $name }</name> Don't hesitate to ask for more, Andreas Am 30.03.2011 um 22:02 schrieb Wiard Vasen:

Dear sirs of Basex, I am doing my Master thesis on the letters of Vincent van Gogh at the University of Amsterdam. For that purpose I use BaseX to analyze the letters. I wonder, is there the possibility to generate a tf/idf score

automatically?

...
In your faq I noticed there needs to be a special term like 'SET SCORING

0'

...
to be able to get a tf/idf score. This information I get from the following page: http://docs.basex.org/wiki/Full-Text Could you help me with this? I would be very grateful. Kind regards, _______________________________________________ BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk

BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk

Andreas Weiler

10:20 a.m.

Hi Wiard,

i hope i understand your plans, here is what i would do:

for $n in ("betweenlet567.xml", "let689.xml") return for $i at $pos in db:open("tfidfbrievenvangogh")//* where ends-with(base-uri($i), $n) and $i[text() contains text 'kleur'] return <hit>{base-uri($i)}<score>{ft:score($i[text() contains text 'kleur'])}</score></hit>

Now you can extend the variable $n with all filenames you like to have.

I hope this helps, Andreas

Am 03.04.2011 um 14:24 schrieb Wiard Vasen:

...

Hi Andreas,

Wow! This is the complete answer to my question!

I hope you can help me with the next question. Because I am analyzing changes in the artistic life of Van Gogh, I am partitioning the relatively large repository annotated xml files on the basis of residence.

For that reason I need to put a query like:

for $i at $pos in db:open("tfidfbrievenvangogh")//* where $i[text() contains text 'kleur'] return <hit>{base-uri($i)}<score>{ft:score($i[text() contains text 'kleur'])}</score></hit>

with the extension: given the interval, all xml-files betweenlet567.xml and let689.xml. What means that I know that in this partition xml-files Van Gogh was in Arles. And I want to know what is the tf-idf score of the dutch word 'kleur'.

To give a resume of my question: How do I partition the repository in subsets, so that I can produce information on these subsets. And how do I do this in BasX with xquery.

Thanks a lot beforehand!

Kind regards,

Wiard

2011/4/3 Andreas Weiler andreas.weiler@uni-konstanz.de Hi Wiard,

you could use the base-uri function of XQuery, like (probably can be done easier):

for $i at $pos in db:open("DB")//* where $i[text() contains text 'xml'] return <hit>{base-uri($i)}<score>{ft:score($i[text() contains text 'xml'])}</score></hit>

-- Andreas

Am 03.04.2011 um 12:42 schrieb Wiard Vasen:

...
Dear Christian and Andreas,

Thanks for your great help! I used Christians solution: ft:score(db:open("DB")//*[text() contains text 'xml']) And it works fine.

The next step is that I want to get the associated documents with these scores.

Could you help me with this step?

The results I get now is a list with frequencies, without the references to the particular documents. I think what is needed is the tf/idf score.

Regards,

Wiard

2011/3/31 Christian Grün christian.gruen@gmail.com Hi Wiard,

the tf/idf scoring is only available if you are working with full-text index structures. If you have built a full-text index for your database "DB", the following query will yield different scoring results, depending on the chosen scoring model:

ft:score(db:open("DB")//*[text() contains text 'xml'])

As Andreas indicated, however, you may either set the SCORING property with a db command or explicitly choose the type of scoring in the GUI's database creation dialog (Database -> New -> Fulltext -> TF/IDF Scoring).

Christian

On Thu, Mar 31, 2011 at 8:33 AM, Andreas Weiler andreas.weiler@uni-konstanz.de wrote:

...
Dear Wiard Vasen, you just need to set the scoring property once. If you work in the GUI: Go to the top input bar, choose command and type: set scoring *

as * set the scoring algorithm you like. In the console just type: set scoring * After setting this you can use the score function, like in the 8th query of our online demo (basex.org/products/live-demo): let $names := ('Jack London', 'Jack', 'Jim Beam', 'Jack Daniels') for $name at $pos score $score in $names[. contains text 'Jack'] order by $score descending return <name pos='{ $pos }'>{ $name }</name> Don't hesitate to ask for more, Andreas Am 30.03.2011 um 22:02 schrieb Wiard Vasen:

Dear sirs of Basex, I am doing my Master thesis on the letters of Vincent van Gogh at the University of Amsterdam. For that purpose I use BaseX to analyze the letters. I wonder, is there the possibility to generate a tf/idf score automatically? In your faq I noticed there needs to be a special term like 'SET SCORING 0' to be able to get a tf/idf score. This information I get from the following page: http://docs.basex.org/wiki/Full-Text Could you help me with this? I would be very grateful. Kind regards, _______________________________________________ BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk

BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk

Wiard Vasen

12:14 p.m.

Hi Andreas,

Thanks a lot! It works fine.

I was wondering if instead of putting in the next query in BaseX:

for $n in ("let680.xml", "let681.xml", "let682.xml", "let683.xml", "let684.xml", "let685.xml", "let686.xml", "let687.xml", "let688.xml", "let689.xml") return for $i at $pos in db:open("tfidfbrievenvangogh")//* where ends-with(base-uri($i), $n) and $i[text() contains text 'above'] return <hit>{base-uri($i)}<score>{ft:score($i[text() contains text 'above'])}</score></hit>

It is also possible to do something like:

for ("let680.xml" )<= $n <= ("let689") return for $i at $pos in db:open("tfidfbrievenvangogh")//* where ends-with(base-uri($i), $n) and $i[text() contains text 'above'] return <hit>{base-uri($i)}<score>{ft:score($i[text() contains text 'above'])}</score></hit>

That way I hope to define the outer documents of a subset and get all the documents in between, with the outer documents included.

Do you think this is possible in a query like shown above?

Regards,

Wiard

2011/4/3 Andreas Weiler andreas.weiler@uni-konstanz.de

...

Hi Wiard,

i hope i understand your plans, here is what i would do:

for $n in ("betweenlet567.xml", "let689.xml") return for $i at $pos in db:open("tfidfbrievenvangogh")//* where ends-with(base-uri($i), $n) and $i[text() contains text 'kleur'] return <hit>{base-uri($i)}<score>{ft:score($i[text() contains text 'kleur'])}</score></hit>

Now you can extend the variable $n with all filenames you like to have.

I hope this helps, Andreas

Am 03.04.2011 um 14:24 schrieb Wiard Vasen:

Hi Andreas,

Wow! This is the complete answer to my question!

I hope you can help me with the next question. Because I am analyzing changes in the artistic life of Van Gogh, I am partitioning the relatively large repository annotated xml files on the basis of residence.

For that reason I need to put a query like:

for $i at $pos in db:open("tfidfbrievenvangogh")//* where $i[text() contains text 'kleur'] return <hit>{base-uri($i)}<score>{ft:score($i[text() contains text 'kleur'])}</score></hit>

with the extension: given the interval, all xml-files betweenlet567.xml and let689.xml. What means that I know that in this partition xml-files Van Gogh was in Arles. And I want to know what is the tf-idf score of the dutch word 'kleur'.

To give a resume of my question: How do I partition the repository in subsets, so that I can produce information on these subsets. And how do I do this in BasX with xquery.

Thanks a lot beforehand!

Kind regards,

Wiard

2011/4/3 Andreas Weiler andreas.weiler@uni-konstanz.de

...
Hi Wiard,

you could use the base-uri function of XQuery, like (probably can be done easier):

for $i at $pos in db:open("DB")//* where $i[text() contains text 'xml'] return <hit>{base-uri($i)}<score>{ft:score($i[text() contains text 'xml'])}</score></hit>

-- Andreas

Am 03.04.2011 um 12:42 schrieb Wiard Vasen:

Dear Christian and Andreas,

Thanks for your great help! I used Christians solution: ft:score(db:open("DB")//*[text() contains text 'xml']) And it works fine.

The next step is that I want to get the associated documents with these scores.

Could you help me with this step?

The results I get now is a list with frequencies, without the references to the particular documents. I think what is needed is the tf/idf score.

Regards,

Wiard

2011/3/31 Christian Grün christian.gruen@gmail.com

...
Hi Wiard,

the tf/idf scoring is only available if you are working with full-text index structures. If you have built a full-text index for your database "DB", the following query will yield different scoring results, depending on the chosen scoring model:

ft:score(db:open("DB")//*[text() contains text 'xml'])

As Andreas indicated, however, you may either set the SCORING property with a db command or explicitly choose the type of scoring in the GUI's database creation dialog (Database -> New -> Fulltext -> TF/IDF Scoring).

Christian

On Thu, Mar 31, 2011 at 8:33 AM, Andreas Weiler andreas.weiler@uni-konstanz.de wrote:

...
Dear Wiard Vasen, you just need to set the scoring property once. If you work in the GUI: Go to the top input bar, choose command and type: set scoring *

as * set the scoring algorithm you like. In the console just type: set scoring * After setting this you can use the score function, like in the 8th

query of

...
our online demo (basex.org/products/live-demo): let $names := ('Jack London', 'Jack', 'Jim Beam', 'Jack Daniels') for $name at $pos score $score in $names[. contains text 'Jack'] order by $score descending return <name pos='{ $pos }'>{ $name }</name> Don't hesitate to ask for more, Andreas Am 30.03.2011 um 22:02 schrieb Wiard Vasen:

Dear sirs of Basex, I am doing my Master thesis on the letters of Vincent van Gogh at the University of Amsterdam. For that purpose I use BaseX to analyze the letters. I wonder, is there the possibility to generate a tf/idf score

automatically?

...
In your faq I noticed there needs to be a special term like 'SET

SCORING 0'

...
to be able to get a tf/idf score. This information I get from the following page: http://docs.basex.org/wiki/Full-Text Could you help me with this? I would be very grateful. Kind regards, _______________________________________________ BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk

BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk

Andreas Weiler

12:24 p.m.

Hi Wiard,

you can get the position of your wished first and last document with: (add a where clause to get the right documents, like where ends-with(base-uri($n), "test.doc"))

for $n at $pos in db:open("tfidfbrievenvangogh") return <hit><doc>{base-uri($i)</doc><pos>{$pos}</pos></hit>

then set these position for x and y in the below query.

for $n at $pos in db:open("tfidfbrievenvangogh") where $pos > x and $pos < y return for $i in $n//* where $i[text() contains text 'above'] return <hit>{base-uri($i)}<score>{ft:score($i[text() contains text 'above'])}</score></hit>

I hope this works.

-- Andreas

Am 03.04.2011 um 18:14 schrieb Wiard Vasen:

...

Hi Andreas,

Thanks a lot! It works fine.

I was wondering if instead of putting in the next query in BaseX:

for $n in ("let680.xml", "let681.xml", "let682.xml", "let683.xml", "let684.xml", "let685.xml", "let686.xml", "let687.xml", "let688.xml", "let689.xml") return for $i at $pos in db:open("tfidfbrievenvangogh")//* where ends-with(base-uri($i), $n) and $i[text() contains text 'above'] return <hit>{base-uri($i)}<score>{ft:score($i[text() contains text 'above'])}</score></hit>

It is also possible to do something like:

for ("let680.xml" )<= $n <= ("let689") return for $i at $pos in db:open("tfidfbrievenvangogh")//* where ends-with(base-uri($i), $n) and $i[text() contains text 'above'] return <hit>{base-uri($i)}<score>{ft:score($i[text() contains text 'above'])}</score></hit>

That way I hope to define the outer documents of a subset and get all the documents in between, with the outer documents included.

Do you think this is possible in a query like shown above?

Regards,

Wiard

2011/4/3 Andreas Weiler andreas.weiler@uni-konstanz.de Hi Wiard,

i hope i understand your plans, here is what i would do:

for $n in ("betweenlet567.xml", "let689.xml") return for $i at $pos in db:open("tfidfbrievenvangogh")//* where ends-with(base-uri($i), $n) and $i[text() contains text 'kleur'] return <hit>{base-uri($i)}<score>{ft:score($i[text() contains text 'kleur'])}</score></hit>

Now you can extend the variable $n with all filenames you like to have.

I hope this helps, Andreas

Am 03.04.2011 um 14:24 schrieb Wiard Vasen:

...
Hi Andreas,

Wow! This is the complete answer to my question!

I hope you can help me with the next question. Because I am analyzing changes in the artistic life of Van Gogh, I am partitioning the relatively large repository annotated xml files on the basis of residence.

For that reason I need to put a query like:

for $i at $pos in db:open("tfidfbrievenvangogh")//* where $i[text() contains text 'kleur'] return <hit>{base-uri($i)}<score>{ft:score($i[text() contains text 'kleur'])}</score></hit>

with the extension: given the interval, all xml-files betweenlet567.xml and let689.xml. What means that I know that in this partition xml-files Van Gogh was in Arles. And I want to know what is the tf-idf score of the dutch word 'kleur'.

To give a resume of my question: How do I partition the repository in subsets, so that I can produce information on these subsets. And how do I do this in BasX with xquery.

Thanks a lot beforehand!

Kind regards,

Wiard

2011/4/3 Andreas Weiler andreas.weiler@uni-konstanz.de Hi Wiard,

you could use the base-uri function of XQuery, like (probably can be done easier):

for $i at $pos in db:open("DB")//* where $i[text() contains text 'xml'] return <hit>{base-uri($i)}<score>{ft:score($i[text() contains text 'xml'])}</score></hit>

-- Andreas

Am 03.04.2011 um 12:42 schrieb Wiard Vasen:

...
Dear Christian and Andreas,

Thanks for your great help! I used Christians solution: ft:score(db:open("DB")//*[text() contains text 'xml']) And it works fine.

The next step is that I want to get the associated documents with these scores.

Could you help me with this step?

The results I get now is a list with frequencies, without the references to the particular documents. I think what is needed is the tf/idf score.

Regards,

Wiard

2011/3/31 Christian Grün christian.gruen@gmail.com Hi Wiard,

the tf/idf scoring is only available if you are working with full-text index structures. If you have built a full-text index for your database "DB", the following query will yield different scoring results, depending on the chosen scoring model:

ft:score(db:open("DB")//*[text() contains text 'xml'])

As Andreas indicated, however, you may either set the SCORING property with a db command or explicitly choose the type of scoring in the GUI's database creation dialog (Database -> New -> Fulltext -> TF/IDF Scoring).

Christian

On Thu, Mar 31, 2011 at 8:33 AM, Andreas Weiler andreas.weiler@uni-konstanz.de wrote:

...
Dear Wiard Vasen, you just need to set the scoring property once. If you work in the GUI: Go to the top input bar, choose command and type: set scoring *

as * set the scoring algorithm you like. In the console just type: set scoring * After setting this you can use the score function, like in the 8th query of our online demo (basex.org/products/live-demo): let $names := ('Jack London', 'Jack', 'Jim Beam', 'Jack Daniels') for $name at $pos score $score in $names[. contains text 'Jack'] order by $score descending return <name pos='{ $pos }'>{ $name }</name> Don't hesitate to ask for more, Andreas Am 30.03.2011 um 22:02 schrieb Wiard Vasen:

Dear sirs of Basex, I am doing my Master thesis on the letters of Vincent van Gogh at the University of Amsterdam. For that purpose I use BaseX to analyze the letters. I wonder, is there the possibility to generate a tf/idf score automatically? In your faq I noticed there needs to be a special term like 'SET SCORING 0' to be able to get a tf/idf score. This information I get from the following page: http://docs.basex.org/wiki/Full-Text Could you help me with this? I would be very grateful. Kind regards, _______________________________________________ BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk

BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk

Wiard Vasen

12:44 p.m.

Hi Andreas,

Maybe I don't understand the query you suggested. I worked it out this way:

for $n at $pos in db:open("tfidfbrievenvangogh") where $pos > ( "let001.xml") and $pos < ( "let201.xml") return for $i in $n//* where $i[text() contains text 'above'] return <hit>{base-uri($i)}<score>{ft:score($i[text() contains text 'above'])}</score></hit>

I do understand though the error: xs:integer and xs:string can't be compared

How do I improve this query, so that it works?

Regards,

Wiard

2011/4/3 Andreas Weiler andreas.weiler@uni-konstanz.de

...

Hi Wiard,

you can get the position of your wished first and last document with: (add a where clause to get the right documents, like where ends-with(base-uri($n), "test.doc"))

for $n at $pos in db:open("tfidfbrievenvangogh") return <hit><doc>{base-uri($i)</doc><pos>{$pos}</pos></hit>

then set these position for x and y in the below query.

for $n at $pos in db:open("tfidfbrievenvangogh") where $pos > x and $pos < y return for $i in $n//* where $i[text() contains text 'above'] return <hit>{base-uri($i)}<score>{ft:score($i[text() contains text 'above'])}</score></hit>

I hope this works.

-- Andreas

Am 03.04.2011 um 18:14 schrieb Wiard Vasen:

Hi Andreas,

Thanks a lot! It works fine.

I was wondering if instead of putting in the next query in BaseX:

for $n in ("let680.xml", "let681.xml", "let682.xml", "let683.xml", "let684.xml", "let685.xml", "let686.xml", "let687.xml", "let688.xml", "let689.xml") return for $i at $pos in db:open("tfidfbrievenvangogh")//* where ends-with(base-uri($i), $n) and $i[text() contains text 'above'] return <hit>{base-uri($i)}<score>{ft:score($i[text() contains text 'above'])}</score></hit>

It is also possible to do something like:

for ("let680.xml" )<= $n <= ("let689") return for $i at $pos in db:open("tfidfbrievenvangogh")//* where ends-with(base-uri($i), $n) and $i[text() contains text 'above'] return <hit>{base-uri($i)}<score>{ft:score($i[text() contains text 'above'])}</score></hit>

That way I hope to define the outer documents of a subset and get all the documents in between, with the outer documents included.

Do you think this is possible in a query like shown above?

Regards,

Wiard

2011/4/3 Andreas Weiler andreas.weiler@uni-konstanz.de

...
Hi Wiard,

i hope i understand your plans, here is what i would do:

for $n in ("betweenlet567.xml", "let689.xml") return for $i at $pos in db:open("tfidfbrievenvangogh")//* where ends-with(base-uri($i), $n) and $i[text() contains text 'kleur'] return <hit>{base-uri($i)}<score>{ft:score($i[text() contains text 'kleur'])}</score></hit>

Now you can extend the variable $n with all filenames you like to have.

I hope this helps, Andreas

Am 03.04.2011 um 14:24 schrieb Wiard Vasen:

Hi Andreas,

Wow! This is the complete answer to my question!

I hope you can help me with the next question. Because I am analyzing changes in the artistic life of Van Gogh, I am partitioning the relatively large repository annotated xml files on the basis of residence.

For that reason I need to put a query like:

for $i at $pos in db:open("tfidfbrievenvangogh")//* where $i[text() contains text 'kleur'] return <hit>{base-uri($i)}<score>{ft:score($i[text() contains text 'kleur'])}</score></hit>

with the extension: given the interval, all xml-files betweenlet567.xml and let689.xml. What means that I know that in this partition xml-files Van Gogh was in Arles. And I want to know what is the tf-idf score of the dutch word 'kleur'.

To give a resume of my question: How do I partition the repository in subsets, so that I can produce information on these subsets. And how do I do this in BasX with xquery.

Thanks a lot beforehand!

Kind regards,

Wiard

2011/4/3 Andreas Weiler andreas.weiler@uni-konstanz.de

...
Hi Wiard,

you could use the base-uri function of XQuery, like (probably can be done easier):

for $i at $pos in db:open("DB")//* where $i[text() contains text 'xml'] return <hit>{base-uri($i)}<score>{ft:score($i[text() contains text 'xml'])}</score></hit>

-- Andreas

Am 03.04.2011 um 12:42 schrieb Wiard Vasen:

Dear Christian and Andreas,

Thanks for your great help! I used Christians solution: ft:score(db:open("DB")//*[text() contains text 'xml']) And it works fine.

The next step is that I want to get the associated documents with these scores.

Could you help me with this step?

The results I get now is a list with frequencies, without the references to the particular documents. I think what is needed is the tf/idf score.

Regards,

Wiard

2011/3/31 Christian Grün christian.gruen@gmail.com

...
Hi Wiard,

the tf/idf scoring is only available if you are working with full-text index structures. If you have built a full-text index for your database "DB", the following query will yield different scoring results, depending on the chosen scoring model:

ft:score(db:open("DB")//*[text() contains text 'xml'])

As Andreas indicated, however, you may either set the SCORING property with a db command or explicitly choose the type of scoring in the GUI's database creation dialog (Database -> New -> Fulltext -> TF/IDF Scoring).

Christian

On Thu, Mar 31, 2011 at 8:33 AM, Andreas Weiler andreas.weiler@uni-konstanz.de wrote:

...
Dear Wiard Vasen, you just need to set the scoring property once. If you work in the GUI: Go to the top input bar, choose command and type: set scoring *

as * set the scoring algorithm you like. In the console just type: set scoring * After setting this you can use the score function, like in the 8th

query of

...
our online demo (basex.org/products/live-demo): let $names := ('Jack London', 'Jack', 'Jim Beam', 'Jack Daniels') for $name at $pos score $score in $names[. contains text 'Jack'] order by $score descending return <name pos='{ $pos }'>{ $name }</name> Don't hesitate to ask for more, Andreas Am 30.03.2011 um 22:02 schrieb Wiard Vasen:

Dear sirs of Basex, I am doing my Master thesis on the letters of Vincent van Gogh at the University of Amsterdam. For that purpose I use BaseX to analyze the letters. I wonder, is there the possibility to generate a tf/idf score

automatically?

...
In your faq I noticed there needs to be a special term like 'SET

SCORING 0'

...
to be able to get a tf/idf score. This information I get from the following page: http://docs.basex.org/wiki/Full-Text Could you help me with this? I would be very grateful. Kind regards, _______________________________________________ BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk

BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk

Andreas Weiler

1:01 p.m.

I did think that you first get the positions of your first and last document in your range (first query). Note them and put them into the second query for x and y.

-- Andreas

Am 03.04.2011 um 18:44 schrieb Wiard Vasen:

...

Hi Andreas,

Maybe I don't understand the query you suggested. I worked it out this way:

for $n at $pos in db:open("tfidfbrievenvangogh") where $pos > ( "let001.xml") and $pos < ( "let201.xml") return for $i in $n//* where $i[text() contains text 'above'] return <hit>{base-uri($i)}<score>{ft:score($i[text() contains text 'above'])}</score></hit>

I do understand though the error: xs:integer and xs:string can't be compared

How do I improve this query, so that it works?

Regards,

Wiard

2011/4/3 Andreas Weiler andreas.weiler@uni-konstanz.de Hi Wiard,

you can get the position of your wished first and last document with: (add a where clause to get the right documents, like where ends-with(base-uri($n), "test.doc"))

for $n at $pos in db:open("tfidfbrievenvangogh") return <hit><doc>{base-uri($i)</doc><pos>{$pos}</pos></hit>

then set these position for x and y in the below query.

for $n at $pos in db:open("tfidfbrievenvangogh") where $pos > x and $pos < y return for $i in $n//* where $i[text() contains text 'above'] return <hit>{base-uri($i)}<score>{ft:score($i[text() contains text 'above'])}</score></hit>

I hope this works.

-- Andreas

Am 03.04.2011 um 18:14 schrieb Wiard Vasen:

...
Hi Andreas,

Thanks a lot! It works fine.

I was wondering if instead of putting in the next query in BaseX:

for $n in ("let680.xml", "let681.xml", "let682.xml", "let683.xml", "let684.xml", "let685.xml", "let686.xml", "let687.xml", "let688.xml", "let689.xml") return for $i at $pos in db:open("tfidfbrievenvangogh")//* where ends-with(base-uri($i), $n) and $i[text() contains text 'above'] return <hit>{base-uri($i)}<score>{ft:score($i[text() contains text 'above'])}</score></hit>

It is also possible to do something like:

for ("let680.xml" )<= $n <= ("let689") return for $i at $pos in db:open("tfidfbrievenvangogh")//* where ends-with(base-uri($i), $n) and $i[text() contains text 'above'] return <hit>{base-uri($i)}<score>{ft:score($i[text() contains text 'above'])}</score></hit>

That way I hope to define the outer documents of a subset and get all the documents in between, with the outer documents included.

Do you think this is possible in a query like shown above?

Regards,

Wiard

2011/4/3 Andreas Weiler andreas.weiler@uni-konstanz.de Hi Wiard,

i hope i understand your plans, here is what i would do:

for $n in ("betweenlet567.xml", "let689.xml") return for $i at $pos in db:open("tfidfbrievenvangogh")//* where ends-with(base-uri($i), $n) and $i[text() contains text 'kleur'] return <hit>{base-uri($i)}<score>{ft:score($i[text() contains text 'kleur'])}</score></hit>

Now you can extend the variable $n with all filenames you like to have.

I hope this helps, Andreas

Am 03.04.2011 um 14:24 schrieb Wiard Vasen:

...
Hi Andreas,

Wow! This is the complete answer to my question!

I hope you can help me with the next question. Because I am analyzing changes in the artistic life of Van Gogh, I am partitioning the relatively large repository annotated xml files on the basis of residence.

For that reason I need to put a query like:

for $i at $pos in db:open("tfidfbrievenvangogh")//* where $i[text() contains text 'kleur'] return <hit>{base-uri($i)}<score>{ft:score($i[text() contains text 'kleur'])}</score></hit>

with the extension: given the interval, all xml-files betweenlet567.xml and let689.xml. What means that I know that in this partition xml-files Van Gogh was in Arles. And I want to know what is the tf-idf score of the dutch word 'kleur'.

To give a resume of my question: How do I partition the repository in subsets, so that I can produce information on these subsets. And how do I do this in BasX with xquery.

Thanks a lot beforehand!

Kind regards,

Wiard

2011/4/3 Andreas Weiler andreas.weiler@uni-konstanz.de Hi Wiard,

you could use the base-uri function of XQuery, like (probably can be done easier):

for $i at $pos in db:open("DB")//* where $i[text() contains text 'xml'] return <hit>{base-uri($i)}<score>{ft:score($i[text() contains text 'xml'])}</score></hit>

-- Andreas

Am 03.04.2011 um 12:42 schrieb Wiard Vasen:

...
Dear Christian and Andreas,

Thanks for your great help! I used Christians solution: ft:score(db:open("DB")//*[text() contains text 'xml']) And it works fine.

The next step is that I want to get the associated documents with these scores.

Could you help me with this step?

The results I get now is a list with frequencies, without the references to the particular documents. I think what is needed is the tf/idf score.

Regards,

Wiard

2011/3/31 Christian Grün christian.gruen@gmail.com Hi Wiard,

the tf/idf scoring is only available if you are working with full-text index structures. If you have built a full-text index for your database "DB", the following query will yield different scoring results, depending on the chosen scoring model:

ft:score(db:open("DB")//*[text() contains text 'xml'])

As Andreas indicated, however, you may either set the SCORING property with a db command or explicitly choose the type of scoring in the GUI's database creation dialog (Database -> New -> Fulltext -> TF/IDF Scoring).

Christian

On Thu, Mar 31, 2011 at 8:33 AM, Andreas Weiler andreas.weiler@uni-konstanz.de wrote:

...
Dear Wiard Vasen, you just need to set the scoring property once. If you work in the GUI: Go to the top input bar, choose command and type: set scoring *

as * set the scoring algorithm you like. In the console just type: set scoring * After setting this you can use the score function, like in the 8th query of our online demo (basex.org/products/live-demo): let $names := ('Jack London', 'Jack', 'Jim Beam', 'Jack Daniels') for $name at $pos score $score in $names[. contains text 'Jack'] order by $score descending return <name pos='{ $pos }'>{ $name }</name> Don't hesitate to ask for more, Andreas Am 30.03.2011 um 22:02 schrieb Wiard Vasen:

Dear sirs of Basex, I am doing my Master thesis on the letters of Vincent van Gogh at the University of Amsterdam. For that purpose I use BaseX to analyze the letters. I wonder, is there the possibility to generate a tf/idf score automatically? In your faq I noticed there needs to be a special term like 'SET SCORING 0' to be able to get a tf/idf score. This information I get from the following page: http://docs.basex.org/wiki/Full-Text Could you help me with this? I would be very grateful. Kind regards, _______________________________________________ BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk

BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk

Wiard Vasen

1:22 p.m.

Thank you very much Andreas,

You, Christian and Leonard have helped me a lot!

Have a nice evening!

Regards,

Wiard

have helped me a lot!

2011/4/3 Andreas Weiler andreas.weiler@uni-konstanz.de

...

I did think that you first get the positions of your first and last document in your range (first query). Note them and put them into the second query for x and y.

-- Andreas

Am 03.04.2011 um 18:44 schrieb Wiard Vasen:

Hi Andreas,

Maybe I don't understand the query you suggested. I worked it out this way:

for $n at $pos in db:open("tfidfbrievenvangogh") where $pos > ( "let001.xml") and $pos < ( "let201.xml") return for $i in $n//* where $i[text() contains text 'above'] return <hit>{base-uri($i)}<score>{ft:score($i[text() contains text 'above'])}</score></hit>

I do understand though the error: xs:integer and xs:string can't be compared

How do I improve this query, so that it works?

Regards,

Wiard

2011/4/3 Andreas Weiler andreas.weiler@uni-konstanz.de

...
Hi Wiard,

you can get the position of your wished first and last document with: (add a where clause to get the right documents, like where ends-with(base-uri($n), "test.doc"))

for $n at $pos in db:open("tfidfbrievenvangogh") return <hit><doc>{base-uri($i)</doc><pos>{$pos}</pos></hit>

then set these position for x and y in the below query.

for $n at $pos in db:open("tfidfbrievenvangogh") where $pos > x and $pos < y return for $i in $n//* where $i[text() contains text 'above'] return <hit>{base-uri($i)}<score>{ft:score($i[text() contains text 'above'])}</score></hit>

I hope this works.

-- Andreas

Am 03.04.2011 um 18:14 schrieb Wiard Vasen:

Hi Andreas,

Thanks a lot! It works fine.

I was wondering if instead of putting in the next query in BaseX:

for $n in ("let680.xml", "let681.xml", "let682.xml", "let683.xml", "let684.xml", "let685.xml", "let686.xml", "let687.xml", "let688.xml", "let689.xml") return for $i at $pos in db:open("tfidfbrievenvangogh")//* where ends-with(base-uri($i), $n) and $i[text() contains text 'above'] return <hit>{base-uri($i)}<score>{ft:score($i[text() contains text 'above'])}</score></hit>

It is also possible to do something like:

for ("let680.xml" )<= $n <= ("let689") return for $i at $pos in db:open("tfidfbrievenvangogh")//* where ends-with(base-uri($i), $n) and $i[text() contains text 'above'] return <hit>{base-uri($i)}<score>{ft:score($i[text() contains text 'above'])}</score></hit>

That way I hope to define the outer documents of a subset and get all the documents in between, with the outer documents included.

Do you think this is possible in a query like shown above?

Regards,

Wiard

2011/4/3 Andreas Weiler andreas.weiler@uni-konstanz.de

...
Hi Wiard,

i hope i understand your plans, here is what i would do:

for $n in ("betweenlet567.xml", "let689.xml") return for $i at $pos in db:open("tfidfbrievenvangogh")//* where ends-with(base-uri($i), $n) and $i[text() contains text 'kleur'] return <hit>{base-uri($i)}<score>{ft:score($i[text() contains text 'kleur'])}</score></hit>

Now you can extend the variable $n with all filenames you like to have.

I hope this helps, Andreas

Am 03.04.2011 um 14:24 schrieb Wiard Vasen:

Hi Andreas,

Wow! This is the complete answer to my question!

I hope you can help me with the next question. Because I am analyzing changes in the artistic life of Van Gogh, I am partitioning the relatively large repository annotated xml files on the basis of residence.

For that reason I need to put a query like:

for $i at $pos in db:open("tfidfbrievenvangogh")//* where $i[text() contains text 'kleur'] return <hit>{base-uri($i)}<score>{ft:score($i[text() contains text 'kleur'])}</score></hit>

with the extension: given the interval, all xml-files betweenlet567.xml and let689.xml. What means that I know that in this partition xml-files Van Gogh was in Arles. And I want to know what is the tf-idf score of the dutch word 'kleur'.

To give a resume of my question: How do I partition the repository in subsets, so that I can produce information on these subsets. And how do I do this in BasX with xquery.

Thanks a lot beforehand!

Kind regards,

Wiard

2011/4/3 Andreas Weiler andreas.weiler@uni-konstanz.de

...
Hi Wiard,

you could use the base-uri function of XQuery, like (probably can be done easier):

for $i at $pos in db:open("DB")//* where $i[text() contains text 'xml'] return <hit>{base-uri($i)}<score>{ft:score($i[text() contains text 'xml'])}</score></hit>

-- Andreas

Am 03.04.2011 um 12:42 schrieb Wiard Vasen:

Dear Christian and Andreas,

Thanks for your great help! I used Christians solution: ft:score(db:open("DB")//*[text() contains text 'xml']) And it works fine.

The next step is that I want to get the associated documents with these scores.

Could you help me with this step?

The results I get now is a list with frequencies, without the references to the particular documents. I think what is needed is the tf/idf score.

Regards,

Wiard

2011/3/31 Christian Grün christian.gruen@gmail.com

...
Hi Wiard,

the tf/idf scoring is only available if you are working with full-text index structures. If you have built a full-text index for your database "DB", the following query will yield different scoring results, depending on the chosen scoring model:

ft:score(db:open("DB")//*[text() contains text 'xml'])

As Andreas indicated, however, you may either set the SCORING property with a db command or explicitly choose the type of scoring in the GUI's database creation dialog (Database -> New -> Fulltext -> TF/IDF Scoring).

Christian

On Thu, Mar 31, 2011 at 8:33 AM, Andreas Weiler andreas.weiler@uni-konstanz.de wrote:

...
Dear Wiard Vasen, you just need to set the scoring property once. If you work in the GUI: Go to the top input bar, choose command and type: set scoring *

as * set the scoring algorithm you like. In the console just type: set scoring * After setting this you can use the score function, like in the 8th

query of

...
our online demo (basex.org/products/live-demo): let $names := ('Jack London', 'Jack', 'Jim Beam', 'Jack Daniels') for $name at $pos score $score in $names[. contains text 'Jack'] order by $score descending return <name pos='{ $pos }'>{ $name }</name> Don't hesitate to ask for more, Andreas Am 30.03.2011 um 22:02 schrieb Wiard Vasen:

Dear sirs of Basex, I am doing my Master thesis on the letters of Vincent van Gogh at the University of Amsterdam. For that purpose I use BaseX to analyze the letters. I wonder, is there the possibility to generate a tf/idf score

automatically?

...
In your faq I noticed there needs to be a special term like 'SET

SCORING 0'

...
to be able to get a tf/idf score. This information I get from the following page: http://docs.basex.org/wiki/Full-Text Could you help me with this? I would be very grateful. Kind regards, _______________________________________________ BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk

BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk

Wiard Vasen

5 Apr 5 Apr

8:43 a.m.

Hi Christian, Andreas and Leonard,

Last week you've helped me with a query in BaseX. Underneath is the specific query:

let $range := 1 to 201 for $doc in collection('tfidfbrievenvangogh') let $uri := base-uri($doc), $num := substring($uri, string-length($uri) - 6, 3) where $num castable as xs:integer and xs:integer($num) = $range return <document uri='{$uri}'>{ for $n score $s in $doc//*[text() contains text 'above'] return <hit score='{$s}'>{ $n }</hit> }</document

My question is: Does the scores given back represent tf/idf?

Kind regards,

Wiard

2011/4/3 Wiard Vasen wiard.vasen@gmail.com

...

Thank you very much Andreas,

You, Christian and Leonard have helped me a lot!

Have a nice evening!

Regards,

Wiard

have helped me a lot!

2011/4/3 Andreas Weiler andreas.weiler@uni-konstanz.de

...
I did think that you first get the positions of your first and last document in your range (first query). Note them and put them into the second query for x and y.

-- Andreas

Am 03.04.2011 um 18:44 schrieb Wiard Vasen:

Hi Andreas,

Maybe I don't understand the query you suggested. I worked it out this way:

for $n at $pos in db:open("tfidfbrievenvangogh") where $pos > ( "let001.xml") and $pos < ( "let201.xml") return for $i in $n//* where $i[text() contains text 'above'] return <hit>{base-uri($i)}<score>{ft:score($i[text() contains text 'above'])}</score></hit>

I do understand though the error: xs:integer and xs:string can't be compared

How do I improve this query, so that it works?

Regards,

Wiard

2011/4/3 Andreas Weiler andreas.weiler@uni-konstanz.de

...
Hi Wiard,

you can get the position of your wished first and last document with: (add a where clause to get the right documents, like where ends-with(base-uri($n), "test.doc"))

for $n at $pos in db:open("tfidfbrievenvangogh") return <hit><doc>{base-uri($i)</doc><pos>{$pos}</pos></hit>

then set these position for x and y in the below query.

for $n at $pos in db:open("tfidfbrievenvangogh") where $pos > x and $pos < y return for $i in $n//* where $i[text() contains text 'above'] return <hit>{base-uri($i)}<score>{ft:score($i[text() contains text 'above'])}</score></hit>

I hope this works.

-- Andreas

Am 03.04.2011 um 18:14 schrieb Wiard Vasen:

Hi Andreas,

Thanks a lot! It works fine.

I was wondering if instead of putting in the next query in BaseX:

for $n in ("let680.xml", "let681.xml", "let682.xml", "let683.xml", "let684.xml", "let685.xml", "let686.xml", "let687.xml", "let688.xml", "let689.xml") return for $i at $pos in db:open("tfidfbrievenvangogh")//* where ends-with(base-uri($i), $n) and $i[text() contains text 'above'] return <hit>{base-uri($i)}<score>{ft:score($i[text() contains text 'above'])}</score></hit>

It is also possible to do something like:

for ("let680.xml" )<= $n <= ("let689") return for $i at $pos in db:open("tfidfbrievenvangogh")//* where ends-with(base-uri($i), $n) and $i[text() contains text 'above'] return <hit>{base-uri($i)}<score>{ft:score($i[text() contains text 'above'])}</score></hit>

That way I hope to define the outer documents of a subset and get all the documents in between, with the outer documents included.

Do you think this is possible in a query like shown above?

Regards,

Wiard

2011/4/3 Andreas Weiler andreas.weiler@uni-konstanz.de

...
Hi Wiard,

i hope i understand your plans, here is what i would do:

for $n in ("betweenlet567.xml", "let689.xml") return for $i at $pos in db:open("tfidfbrievenvangogh")//* where ends-with(base-uri($i), $n) and $i[text() contains text 'kleur'] return <hit>{base-uri($i)}<score>{ft:score($i[text() contains text 'kleur'])}</score></hit>

Now you can extend the variable $n with all filenames you like to have.

I hope this helps, Andreas

Am 03.04.2011 um 14:24 schrieb Wiard Vasen:

Hi Andreas,

Wow! This is the complete answer to my question!

I hope you can help me with the next question. Because I am analyzing changes in the artistic life of Van Gogh, I am partitioning the relatively large repository annotated xml files on the basis of residence.

For that reason I need to put a query like:

for $i at $pos in db:open("tfidfbrievenvangogh")//* where $i[text() contains text 'kleur'] return <hit>{base-uri($i)}<score>{ft:score($i[text() contains text 'kleur'])}</score></hit>

with the extension: given the interval, all xml-files betweenlet567.xml and let689.xml. What means that I know that in this partition xml-files Van Gogh was in Arles. And I want to know what is the tf-idf score of the dutch word 'kleur'.

To give a resume of my question: How do I partition the repository in subsets, so that I can produce information on these subsets. And how do I do this in BasX with xquery.

Thanks a lot beforehand!

Kind regards,

Wiard

2011/4/3 Andreas Weiler andreas.weiler@uni-konstanz.de

...
Hi Wiard,

you could use the base-uri function of XQuery, like (probably can be done easier):

for $i at $pos in db:open("DB")//* where $i[text() contains text 'xml'] return <hit>{base-uri($i)}<score>{ft:score($i[text() contains text 'xml'])}</score></hit>

-- Andreas

Am 03.04.2011 um 12:42 schrieb Wiard Vasen:

Dear Christian and Andreas,

Thanks for your great help! I used Christians solution: ft:score(db:open("DB")//*[text() contains text 'xml']) And it works fine.

The next step is that I want to get the associated documents with these scores.

Could you help me with this step?

The results I get now is a list with frequencies, without the references to the particular documents. I think what is needed is the tf/idf score.

Regards,

Wiard

2011/3/31 Christian Grün christian.gruen@gmail.com

...
Hi Wiard,

the tf/idf scoring is only available if you are working with full-text index structures. If you have built a full-text index for your database "DB", the following query will yield different scoring results, depending on the chosen scoring model:

ft:score(db:open("DB")//*[text() contains text 'xml'])

As Andreas indicated, however, you may either set the SCORING property with a db command or explicitly choose the type of scoring in the GUI's database creation dialog (Database -> New -> Fulltext -> TF/IDF Scoring).

Christian

On Thu, Mar 31, 2011 at 8:33 AM, Andreas Weiler andreas.weiler@uni-konstanz.de wrote: > Dear Wiard Vasen, > you just need to set the scoring property once. > If you work in the GUI: > Go to the top input bar, choose command and type: > set scoring * > > as * set the scoring algorithm you like. > In the console just type: set scoring * > After setting this you can use the score function, like in the 8th query of > our online demo (basex.org/products/live-demo): > let $names := ('Jack London', 'Jack', 'Jim Beam', 'Jack Daniels') > for $name at $pos score $score in $names[. contains text 'Jack'] > order by $score descending > return <name pos='{ $pos }'>{ $name }</name> > Don't hesitate to ask for more, > Andreas > Am 30.03.2011 um 22:02 schrieb Wiard Vasen: > > Dear sirs of Basex, > I am doing my Master thesis on the letters of Vincent van Gogh at the > University of Amsterdam. > For that purpose I use BaseX to analyze the letters. > I wonder, is there the possibility to generate a tf/idf score automatically? > In your faq I noticed there needs to be a special term like 'SET SCORING 0' > to be able to get a tf/idf score. > This information I get from the following > page: http://docs.basex.org/wiki/Full-Text > Could you help me with this? > I would be very grateful. > Kind regards, > _______________________________________________ > BaseX-Talk mailing list > BaseX-Talk@mailman.uni-konstanz.de > https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk > > > _______________________________________________ > BaseX-Talk mailing list > BaseX-Talk@mailman.uni-konstanz.de > https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk > >

Christian Grün

8:53 a.m.

Dear Wiard,

My question is: Does the scores given back represent tf/idf?

...

It depends on your full text index; which scoring mode have you chosen? Next, please have a look at the Query Info (Query -> Query Info) to get some more insight into the internals.

Best, Christian

...

Kind regards,

Wiard

2011/4/3 Wiard Vasen wiard.vasen@gmail.com

...
Thank you very much Andreas,

You, Christian and Leonard have helped me a lot!

Have a nice evening!

Regards,

Wiard

have helped me a lot!

2011/4/3 Andreas Weiler andreas.weiler@uni-konstanz.de

...
I did think that you first get the positions of your first and last document in your range (first query). Note them and put them into the second query for x and y.

-- Andreas

Am 03.04.2011 um 18:44 schrieb Wiard Vasen:

Hi Andreas,

Maybe I don't understand the query you suggested. I worked it out this way:

for $n at $pos in db:open("tfidfbrievenvangogh") where $pos > ( "let001.xml") and $pos < ( "let201.xml") return for $i in $n//* where $i[text() contains text 'above'] return <hit>{base-uri($i)}<score>{ft:score($i[text() contains text 'above'])}</score></hit>

I do understand though the error: xs:integer and xs:string can't be compared

How do I improve this query, so that it works?

Regards,

Wiard

2011/4/3 Andreas Weiler andreas.weiler@uni-konstanz.de

...
Hi Wiard,

you can get the position of your wished first and last document with: (add a where clause to get the right documents, like where ends-with(base-uri($n), "test.doc"))

for $n at $pos in db:open("tfidfbrievenvangogh") return <hit><doc>{base-uri($i)</doc><pos>{$pos}</pos></hit>

then set these position for x and y in the below query.

for $n at $pos in db:open("tfidfbrievenvangogh") where $pos > x and $pos < y return for $i in $n//* where $i[text() contains text 'above'] return <hit>{base-uri($i)}<score>{ft:score($i[text() contains text 'above'])}</score></hit>

I hope this works.

-- Andreas

Am 03.04.2011 um 18:14 schrieb Wiard Vasen:

Hi Andreas,

Thanks a lot! It works fine.

I was wondering if instead of putting in the next query in BaseX:

for $n in ("let680.xml", "let681.xml", "let682.xml", "let683.xml", "let684.xml", "let685.xml", "let686.xml", "let687.xml", "let688.xml", "let689.xml") return for $i at $pos in db:open("tfidfbrievenvangogh")//* where ends-with(base-uri($i), $n) and $i[text() contains text 'above'] return <hit>{base-uri($i)}<score>{ft:score($i[text() contains text 'above'])}</score></hit>

It is also possible to do something like:

for ("let680.xml" )<= $n <= ("let689") return for $i at $pos in db:open("tfidfbrievenvangogh")//* where ends-with(base-uri($i), $n) and $i[text() contains text 'above'] return <hit>{base-uri($i)}<score>{ft:score($i[text() contains text 'above'])}</score></hit>

That way I hope to define the outer documents of a subset and get all the documents in between, with the outer documents included.

Do you think this is possible in a query like shown above?

Regards,

Wiard

2011/4/3 Andreas Weiler andreas.weiler@uni-konstanz.de

...
Hi Wiard,

i hope i understand your plans, here is what i would do:

for $n in ("betweenlet567.xml", "let689.xml") return for $i at $pos in db:open("tfidfbrievenvangogh")//* where ends-with(base-uri($i), $n) and $i[text() contains text 'kleur'] return <hit>{base-uri($i)}<score>{ft:score($i[text() contains text 'kleur'])}</score></hit>

Now you can extend the variable $n with all filenames you like to have.

I hope this helps, Andreas

Am 03.04.2011 um 14:24 schrieb Wiard Vasen:

Hi Andreas,

Wow! This is the complete answer to my question!

I hope you can help me with the next question. Because I am analyzing changes in the artistic life of Van Gogh, I am partitioning the relatively large repository annotated xml files on the basis of residence.

For that reason I need to put a query like:

for $i at $pos in db:open("tfidfbrievenvangogh")//* where $i[text() contains text 'kleur'] return <hit>{base-uri($i)}<score>{ft:score($i[text() contains text 'kleur'])}</score></hit>

with the extension: given the interval, all xml-files betweenlet567.xml and let689.xml. What means that I know that in this partition xml-files Van Gogh was in Arles. And I want to know what is the tf-idf score of the dutch word 'kleur'.

To give a resume of my question: How do I partition the repository in subsets, so that I can produce information on these subsets. And how do I do this in BasX with xquery.

Thanks a lot beforehand!

Kind regards,

Wiard

2011/4/3 Andreas Weiler andreas.weiler@uni-konstanz.de

...
Hi Wiard,

you could use the base-uri function of XQuery, like (probably can be done easier):

for $i at $pos in db:open("DB")//* where $i[text() contains text 'xml'] return <hit>{base-uri($i)}<score>{ft:score($i[text() contains text 'xml'])}</score></hit>

-- Andreas

Am 03.04.2011 um 12:42 schrieb Wiard Vasen:

Dear Christian and Andreas,

Thanks for your great help! I used Christians solution: ft:score(db:open("DB")//*[text() contains text 'xml']) And it works fine.

The next step is that I want to get the associated documents with these scores.

Could you help me with this step?

The results I get now is a list with frequencies, without the references to the particular documents. I think what is needed is the tf/idf score.

Regards,

Wiard

2011/3/31 Christian Grün christian.gruen@gmail.com

> Hi Wiard, > > the tf/idf scoring is only available if you are working with > full-text > index structures. If you have built a full-text index for your > database "DB", the following query will yield different scoring > results, depending on the chosen scoring model: > > ft:score(db:open("DB")//*[text() contains text 'xml']) > > As Andreas indicated, however, you may either set the SCORING > property > with a db command or explicitly choose the type of scoring in the > GUI's database creation dialog (Database -> New -> Fulltext -> TF/IDF > Scoring). > > Christian > > > > On Thu, Mar 31, 2011 at 8:33 AM, Andreas Weiler > andreas.weiler@uni-konstanz.de wrote: > > Dear Wiard Vasen, > > you just need to set the scoring property once. > > If you work in the GUI: > > Go to the top input bar, choose command and type: > > set scoring * > > > > as * set the scoring algorithm you like. > > In the console just type: set scoring * > > After setting this you can use the score function, like in the 8th > query of > > our online demo (basex.org/products/live-demo): > > let $names := ('Jack London', 'Jack', 'Jim Beam', 'Jack Daniels') > > for $name at $pos score $score in $names[. contains text 'Jack'] > > order by $score descending > > return <name pos='{ $pos }'>{ $name }</name> > > Don't hesitate to ask for more, > > Andreas > > Am 30.03.2011 um 22:02 schrieb Wiard Vasen: > > > > Dear sirs of Basex, > > I am doing my Master thesis on the letters of Vincent van Gogh at > the > > University of Amsterdam. > > For that purpose I use BaseX to analyze the letters. > > I wonder, is there the possibility to generate a tf/idf score > automatically? > > In your faq I noticed there needs to be a special term like 'SET > SCORING 0' > > to be able to get a tf/idf score. > > This information I get from the following > > page: http://docs.basex.org/wiki/Full-Text > > Could you help me with this? > > I would be very grateful. > > Kind regards, > > _______________________________________________ > > BaseX-Talk mailing list > > BaseX-Talk@mailman.uni-konstanz.de > > https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk > > > > > > _______________________________________________ > > BaseX-Talk mailing list > > BaseX-Talk@mailman.uni-konstanz.de > > https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk > > > > >

Wiard Vasen

9:04 a.m.

Dear Christian,

When I initialized the database I marked in 'Full Text' properties the TF / IDF checkbox. So, I think that 'score' in the query gives this score back.

Do you think I am right?

Thanks in advance for your answer.

Kind regards,

Wiard

2011/4/5 Christian Grün christian.gruen@gmail.com

...

Dear Wiard,

My question is: Does the scores given back represent tf/idf?

...
It depends on your full text index; which scoring mode have you chosen? Next, please have a look at the Query Info (Query -> Query Info) to get some more insight into the internals.

Best, Christian

...
Kind regards,

Wiard

2011/4/3 Wiard Vasen wiard.vasen@gmail.com

...
Thank you very much Andreas,

You, Christian and Leonard have helped me a lot!

Have a nice evening!

Regards,

Wiard

have helped me a lot!

2011/4/3 Andreas Weiler andreas.weiler@uni-konstanz.de

...
I did think that you first get the positions of your first and last document in your range (first query). Note them and put them into the second query for x and y.

-- Andreas

Am 03.04.2011 um 18:44 schrieb Wiard Vasen:

Hi Andreas,

Maybe I don't understand the query you suggested. I worked it out this way:

for $n at $pos in db:open("tfidfbrievenvangogh") where $pos > ( "let001.xml") and $pos < ( "let201.xml") return for $i in $n//* where $i[text() contains text 'above'] return <hit>{base-uri($i)}<score>{ft:score($i[text() contains text 'above'])}</score></hit>

I do understand though the error: xs:integer and xs:string can't be compared

How do I improve this query, so that it works?

Regards,

Wiard

2011/4/3 Andreas Weiler andreas.weiler@uni-konstanz.de

...
Hi Wiard,

you can get the position of your wished first and last document with: (add a where clause to get the right documents, like where ends-with(base-uri($n), "test.doc"))

for $n at $pos in db:open("tfidfbrievenvangogh") return <hit><doc>{base-uri($i)</doc><pos>{$pos}</pos></hit>

then set these position for x and y in the below query.

for $n at $pos in db:open("tfidfbrievenvangogh") where $pos > x and $pos < y return for $i in $n//* where $i[text() contains text 'above'] return <hit>{base-uri($i)}<score>{ft:score($i[text() contains text 'above'])}</score></hit>

I hope this works.

-- Andreas

Am 03.04.2011 um 18:14 schrieb Wiard Vasen:

Hi Andreas,

Thanks a lot! It works fine.

I was wondering if instead of putting in the next query in BaseX:

for $n in ("let680.xml", "let681.xml", "let682.xml", "let683.xml", "let684.xml", "let685.xml", "let686.xml", "let687.xml", "let688.xml", "let689.xml") return for $i at $pos in db:open("tfidfbrievenvangogh")//* where ends-with(base-uri($i), $n) and $i[text() contains text 'above'] return <hit>{base-uri($i)}<score>{ft:score($i[text() contains text 'above'])}</score></hit>

It is also possible to do something like:

for ("let680.xml" )<= $n <= ("let689") return for $i at $pos in db:open("tfidfbrievenvangogh")//* where ends-with(base-uri($i), $n) and $i[text() contains text 'above'] return <hit>{base-uri($i)}<score>{ft:score($i[text() contains text 'above'])}</score></hit>

That way I hope to define the outer documents of a subset and get all the documents in between, with the outer documents included.

Do you think this is possible in a query like shown above?

Regards,

Wiard

2011/4/3 Andreas Weiler andreas.weiler@uni-konstanz.de

...
Hi Wiard,

i hope i understand your plans, here is what i would do:

for $n in ("betweenlet567.xml", "let689.xml") return for $i at $pos in db:open("tfidfbrievenvangogh")//* where ends-with(base-uri($i), $n) and $i[text() contains text 'kleur'] return <hit>{base-uri($i)}<score>{ft:score($i[text() contains text 'kleur'])}</score></hit>

Now you can extend the variable $n with all filenames you like to have.

I hope this helps, Andreas

Am 03.04.2011 um 14:24 schrieb Wiard Vasen:

Hi Andreas,

Wow! This is the complete answer to my question!

I hope you can help me with the next question. Because I am analyzing changes in the artistic life of Van Gogh, I am partitioning the relatively large repository annotated xml files on the basis of residence.

For that reason I need to put a query like:

for $i at $pos in db:open("tfidfbrievenvangogh")//* where $i[text() contains text 'kleur'] return <hit>{base-uri($i)}<score>{ft:score($i[text() contains text 'kleur'])}</score></hit>

with the extension: given the interval, all xml-files betweenlet567.xml and let689.xml. What means that I know that in this partition xml-files Van Gogh was in Arles. And I want to know what is the tf-idf score of the dutch word 'kleur'.

To give a resume of my question: How do I partition the repository in subsets, so that I can produce information on these subsets. And how do I do this in BasX with xquery.

Thanks a lot beforehand!

Kind regards,

Wiard

2011/4/3 Andreas Weiler andreas.weiler@uni-konstanz.de

> Hi Wiard, > > you could use the base-uri function of XQuery, like (probably can be > done easier): > > for $i at $pos in db:open("DB")//* > where $i[text() contains text 'xml'] > return <hit>{base-uri($i)}<score>{ft:score($i[text() contains text > 'xml'])}</score></hit> > > -- Andreas > > Am 03.04.2011 um 12:42 schrieb Wiard Vasen: > > Dear Christian and Andreas, > > Thanks for your great help! > I used Christians solution: ft:score(db:open("DB")//*[text() > contains text 'xml']) > And it works fine. > > The next step is that I want to get the associated documents with > these scores. > > Could you help me with this step? > > The results I get now is a list with frequencies, without the > references to the particular documents. > I think what is needed is the tf/idf score. > > Regards, > > Wiard > > > 2011/3/31 Christian Grün christian.gruen@gmail.com > >> Hi Wiard, >> >> the tf/idf scoring is only available if you are working with >> full-text >> index structures. If you have built a full-text index for your >> database "DB", the following query will yield different scoring >> results, depending on the chosen scoring model: >> >> ft:score(db:open("DB")//*[text() contains text 'xml']) >> >> As Andreas indicated, however, you may either set the SCORING >> property >> with a db command or explicitly choose the type of scoring in the >> GUI's database creation dialog (Database -> New -> Fulltext -> >> TF/IDF >> Scoring). >> >> Christian >> >> >> >> On Thu, Mar 31, 2011 at 8:33 AM, Andreas Weiler >> andreas.weiler@uni-konstanz.de wrote: >> > Dear Wiard Vasen, >> > you just need to set the scoring property once. >> > If you work in the GUI: >> > Go to the top input bar, choose command and type: >> > set scoring * >> > >> > as * set the scoring algorithm you like. >> > In the console just type: set scoring * >> > After setting this you can use the score function, like in the 8th >> query of >> > our online demo (basex.org/products/live-demo): >> > let $names := ('Jack London', 'Jack', 'Jim Beam', 'Jack Daniels') >> > for $name at $pos score $score in $names[. contains text 'Jack'] >> > order by $score descending >> > return <name pos='{ $pos }'>{ $name }</name> >> > Don't hesitate to ask for more, >> > Andreas >> > Am 30.03.2011 um 22:02 schrieb Wiard Vasen: >> > >> > Dear sirs of Basex, >> > I am doing my Master thesis on the letters of Vincent van Gogh at >> the >> > University of Amsterdam. >> > For that purpose I use BaseX to analyze the letters. >> > I wonder, is there the possibility to generate a tf/idf score >> automatically? >> > In your faq I noticed there needs to be a special term like 'SET >> SCORING 0' >> > to be able to get a tf/idf score. >> > This information I get from the following >> > page: http://docs.basex.org/wiki/Full-Text >> > Could you help me with this? >> > I would be very grateful. >> > Kind regards, >> > _______________________________________________ >> > BaseX-Talk mailing list >> > BaseX-Talk@mailman.uni-konstanz.de >> > https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk >> > >> > >> > _______________________________________________ >> > BaseX-Talk mailing list >> > BaseX-Talk@mailman.uni-konstanz.de >> > https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk >> > >> > >> > > >

Christian Grün

9:20 a.m.

Dear Wiard,

what does the query info tell you? Just copy&paste the info to this list.

Thanks Christian ___________________________

On Tue, Apr 5, 2011 at 3:04 PM, Wiard Vasen wiard.vasen@gmail.com wrote:

...

Dear Christian,

When I initialized the database I marked in 'Full Text' properties the TF / IDF checkbox. So, I think that 'score' in the query gives this score back.

Do you think I am right?

Thanks in advance for your answer.

Kind regards,

Wiard

Wiard Vasen

10:52 a.m.

Hi Christian,

This is the result of the query:

<document uri="file:/Users/wiardvasen/Desktop/brievenvangogh/let001.xml"/> <document uri="file:/Users/wiardvasen/Desktop/brievenvangogh/let002.xml"/> <document uri="file:/Users/wiardvasen/Desktop/brievenvangogh/let003.xml"/> <document uri="file:/Users/wiardvasen/Desktop/brievenvangogh/let004.xml"/> <document uri="file:/Users/wiardvasen/Desktop/brievenvangogh/let005.xml"> <hit score="0.03590675482297878"> <ab xmlns="http://www.tei-c.org/ns/1.0" rend="indent">How is your boarding-house? Is it still to your liking? That’s important. Above all, you must write more about the kind of things you see. Sunday a fortnight ago I was in Amsterdam to see an exhibition of the paintings going to Vienna from here.<anchor n="6" xml:id="note-t-6"/>It was very interesting, and I’m curious<pb f="1r" n="4" xml:id="pb-trans-1r-4" facs="#zone-pb-1r-4"/>as to the impression the Dutch will make in Vienna.</ab> </hit> </document>

And this is the query:

let $range := 1 to 5 for $doc in collection('christian') let $uri := base-uri($doc), $num := substring($uri, string-length($uri) - 6, 3) where $num castable as xs:integer and xs:integer($num) = $range return <document uri='{$uri}'>{ for $n score $s in $doc//*[text() contains text 'above'] return <hit score='{$s}'>{ $n }</hit> }</document>

Kind regards,

Wiard

2011/4/5 Christian Grün christian.gruen@gmail.com

...

Dear Wiard,

what does the query info tell you? Just copy&paste the info to this list.

Thanks Christian ___________________________

On Tue, Apr 5, 2011 at 3:04 PM, Wiard Vasen wiard.vasen@gmail.com wrote:

...
Dear Christian,

When I initialized the database I marked in 'Full Text' properties the TF / IDF checkbox. So, I think that 'score' in the query gives this score back.

Do you think I am right?

Thanks in advance for your answer.

Kind regards,

Wiard

Christian Grün

11:20 a.m.

Hi Wiard,

looks like you sent me the query result. To tell you if the index was utilized, I need the output from the »Query Info« panel, or (if that won't help) the original data instances.

Christian

On Tue, Apr 5, 2011 at 4:52 PM, Wiard Vasen wiard.vasen@gmail.com wrote:

...

Hi Christian, This is the result of the query:

<document uri="file:/Users/wiardvasen/Desktop/brievenvangogh/let001.xml"/> <document uri="file:/Users/wiardvasen/Desktop/brievenvangogh/let002.xml"/> <document uri="file:/Users/wiardvasen/Desktop/brievenvangogh/let003.xml"/> <document uri="file:/Users/wiardvasen/Desktop/brievenvangogh/let004.xml"/> <document uri="file:/Users/wiardvasen/Desktop/brievenvangogh/let005.xml"> <hit score="0.03590675482297878"> <ab xmlns="http://www.tei-c.org/ns/1.0" rend="indent">How is your boarding-house? Is it still to your liking? That’s important. Above all, you must write more about the kind of things you see. Sunday a fortnight ago I was in Amsterdam to see an exhibition of the paintings going to Vienna from here.<anchor n="6" xml:id="note-t-6"/>It was very interesting, and I’m curious<pb f="1r" n="4" xml:id="pb-trans-1r-4" facs="#zone-pb-1r-4"/>as to the impression the Dutch will make in Vienna.</ab> </hit> </document> And this is the query: let $range := 1 to 5 for $doc in collection('christian') let $uri := base-uri($doc), $num := substring($uri, string-length($uri) - 6, 3) where $num castable as xs:integer and xs:integer($num) = $range return <document uri='{$uri}'>{ for $n score $s in $doc//*[text() contains text 'above'] return <hit score='{$s}'>{ $n }</hit> }</document> Kind regards, Wiard 2011/4/5 Christian Grün <christian.gruen@gmail.com> > > Dear Wiard, > what does the query info tell you? Just copy&paste the info to this list. > Thanks > Christian > ___________________________ > > On Tue, Apr 5, 2011 at 3:04 PM, Wiard Vasen <wiard.vasen@gmail.com> wrote: >> >> Dear Christian, >> When I initialized the database I marked in 'Full Text' properties the TF >> / IDF checkbox. >> So, I think that 'score' in the query gives this score back. >> Do you think I am right? >> Thanks in advance for your answer. >> Kind regards, >> Wiard >>

Wiard Vasen

11:23 a.m.

Hi Christian,

Sorry!

This is the information from the Query Info panel: Compiling: - binding static variable $range - pre-evaluating collection("christian") - optimizing descendant-or-self step(s) - removing variable $range Result: for $doc in (document-node { "let001.xml" }, document-node { "let002.xml" }, ...) let $uri := base-uri($doc) let $num := substring($uri, string-length($uri) - 6, 3) where $num castable as xs:integer and $num cast as xs:integer? = 1 to 5 return element { "document" } { attribute { "uri" } { $uri }, for $n score $s as xs:double in $doc/descendant::*[text() contains text "above"] return element { "hit" } { attribute { "score" } { $s }, $n } } Timing: - Parsing: 0.88 ms - Compiling: 5.71 ms - Evaluating: 93.72 ms - Printing: 0.27 ms - Total Time: 100.6 ms Query plan: <FLWR> <For var="$doc"> <sequence size="927"> <document-node() name="christian"/> <document-node() name="christian" pre="480"/> <document-node() name="christian" pre="913"/> <document-node() name="christian" pre="1897"/> <document-node() name="christian" pre="2928"/> </sequence> </For> <Let var="$uri"> <FNNode name="base-uri([node])"> <VarRef name="$doc"/> </FNNode> </Let> <Let var="$num"> <FNStr name="substring(string,start[,len])"> <VarRef name="$uri"/> <Arith op="-"> <FNAcc name="string-length([item])"> <VarRef name="$uri"/> </FNAcc> <Item value="6" type="xs:integer"/> </Arith> <Item value="3" type="xs:integer"/> </FNStr> </Let> <Where> <And> <Castable type="xs:integer"> <VarRef name="$num"/> </Castable> <CmpG op="="> <Cast type="xs:integer?"> <VarRef name="$num"/> </Cast> <Range> <Item value="1" type="xs:integer"/> <Item value="5" type="xs:integer"/> </Range> </CmpG> </And> </Where> <Return> <CElem> <Item value="document" type="xs:QName"/> <CAttr> <Item value="uri" type="xs:QName"/> <VarRef name="$uri"/> </CAttr> <FLWR> <For var="$n" score="$s as xs:double"> <AxisPath> <VarRef name="$doc"/> <IterStep axis="descendant" test="*"> <FTContains> <AxisPath> <IterStep axis="child" test="text()"/> </AxisPath> <FTWords> <Item value="above" type="xs:string"/> </FTWords> </FTContains> </IterStep> </AxisPath> </For> <Return> <CElem> <Item value="hit" type="xs:QName"/> <CAttr> <Item value="score" type="xs:QName"/> <VarRef name="$s as xs:double"/> </CAttr> <VarRef name="$n"/> </CElem> </Return> </FLWR> </CElem> </Return> </FLWR>

Thanks!

Regards,

Wiard

2011/4/5 Christian Grün christian.gruen@gmail.com

...

Hi Wiard,

looks like you sent me the query result. To tell you if the index was utilized, I need the output from the »Query Info« panel, or (if that won't help) the original data instances.

Christian

On Tue, Apr 5, 2011 at 4:52 PM, Wiard Vasen wiard.vasen@gmail.com wrote:

...
Hi Christian, This is the result of the query: <document

uri="file:/Users/wiardvasen/Desktop/brievenvangogh/let001.xml"/>

...
<document

uri="file:/Users/wiardvasen/Desktop/brievenvangogh/let002.xml"/>

...
<document

uri="file:/Users/wiardvasen/Desktop/brievenvangogh/let003.xml"/>

...
<document

uri="file:/Users/wiardvasen/Desktop/brievenvangogh/let004.xml"/>

...
<document uri="file:/Users/wiardvasen/Desktop/brievenvangogh/let005.xml"> <hit score="0.03590675482297878"> <ab xmlns="http://www.tei-c.org/ns/1.0" rend="indent">How is your boarding-house? Is it still to your liking? That’s important. Above all,

you

...
must write more about the kind of things you see. Sunday a fortnight ago

I

...
was in Amsterdam to see an exhibition of the paintings going to Vienna

from

...
here.<anchor n="6" xml:id="note-t-6"/>It was very interesting, and I’m curious<pb f="1r" n="4" xml:id="pb-trans-1r-4" facs="#zone-pb-1r-4"/>as

to

...
the impression the Dutch will make in Vienna.</ab>

</hit> </document> And this is the query: let $range := 1 to 5 for $doc in collection('christian') let $uri := base-uri($doc), $num := substring($uri, string-length($uri) - 6, 3) where $num castable as xs:integer and xs:integer($num) = $range return <document uri='{$uri}'>{ for $n score $s in $doc//*[text() contains text 'above'] return <hit score='{$s}'>{ $n }</hit> }</document> Kind regards, Wiard 2011/4/5 Christian Grün <christian.gruen@gmail.com> > > Dear Wiard, > what does the query info tell you? Just copy&paste the info to this

list.

...
...
Thanks Christian ___________________________

On Tue, Apr 5, 2011 at 3:04 PM, Wiard Vasen wiard.vasen@gmail.com

wrote:

...
...
...
Dear Christian, When I initialized the database I marked in 'Full Text' properties the

TF

...
...
...
/ IDF checkbox. So, I think that 'score' in the query gives this score back. Do you think I am right? Thanks in advance for your answer. Kind regards, Wiard

Christian Grün

11:39 a.m.

...

This is the information from the Query Info panel: Compiling:

binding static variable $range

pre-evaluating collection("christian")

optimizing descendant-or-self step(s)

removing variable $range

Your query seems to be too complex to be evaluated via the full-text index (probably because the nested flwor expression); otherwise, the query info would contain the following line at least once:

- applying full-text index

If you don't want to spend too much time into rewriting your query, you might as well access the index directly, such as:

for $d in collection('coll') for $x in ft:search($d, 'text') where $x/ancestor::node()[. = $d] return ft:score($x)

Hope this helps, Christian

...

Result: for $doc in (document-node { "let001.xml" }, document-node { "let002.xml" }, ...) let $uri := base-uri($doc) let $num := substring($uri, string-length($uri) - 6, 3) where $num castable as xs:integer and $num cast as xs:integer? = 1 to 5 return element { "document" } { attribute { "uri" } { $uri }, for $n score $s as xs:double in $doc/descendant::*[text() contains text "above"] return element { "hit" } { attribute { "score" } { $s }, $n } } Timing: - Parsing: 0.88 ms - Compiling: 5.71 ms - Evaluating: 93.72 ms - Printing: 0.27 ms - Total Time: 100.6 ms Query plan:

<FLWR>   <For var="$doc">    <sequence size="927">    <document-node() name="christian"/>    <document-node() name="christian" pre="480"/>    <document-node() name="christian" pre="913"/>    <document-node() name="christian" pre="1897"/>    <document-node() name="christian" pre="2928"/>    </sequence>   </For>   <Let var="$uri">    <FNNode name="base-uri([node])">    <VarRef name="$doc"/>    </FNNode>   </Let>   <Let var="$num">    <FNStr name="substring(string,start[,len])">    <VarRef name="$uri"/>    <Arith op="-">    <FNAcc name="string-length([item])">    <VarRef name="$uri"/>    </FNAcc>    <Item value="6" type="xs:integer"/>    </Arith>    <Item value="3" type="xs:integer"/>    </FNStr>   </Let>   <Where>    <And>    <Castable type="xs:integer">    <VarRef name="$num"/>    </Castable>    <CmpG op="=">    <Cast type="xs:integer?">    <VarRef name="$num"/>    </Cast>    <Range>    <Item value="1" type="xs:integer"/>    <Item value="5" type="xs:integer"/>    </Range>    </CmpG>    </And>   </Where>   <Return>    <CElem>    <Item value="document" type="xs:QName"/>    <CAttr>    <Item value="uri" type="xs:QName"/>    <VarRef name="$uri"/>    </CAttr>    <FLWR>    <For var="$n" score="$s as xs:double">    <AxisPath>    <VarRef name="$doc"/>    <IterStep axis="descendant" test="*">    <FTContains>    <AxisPath>    <IterStep axis="child" test="text()"/>    </AxisPath>    <FTWords>    <Item value="above" type="xs:string"/>    </FTWords>    </FTContains>    </IterStep>    </AxisPath>    </For>    <Return>    <CElem>    <Item value="hit" type="xs:QName"/>    <CAttr>    <Item value="score" type="xs:QName"/>    <VarRef name="$s as xs:double"/>    </CAttr>    <VarRef name="$n"/>    </CElem>    </Return>    </FLWR>    </CElem>   </Return> </FLWR> Thanks! Regards, Wiard

2011/4/5 Christian Grün christian.gruen@gmail.com

...
Hi Wiard,

looks like you sent me the query result. To tell you if the index was utilized, I need the output from the »Query Info« panel, or (if that won't help) the original data instances.

Christian

On Tue, Apr 5, 2011 at 4:52 PM, Wiard Vasen wiard.vasen@gmail.com wrote:

...
Hi Christian, This is the result of the query: <document uri="file:/Users/wiardvasen/Desktop/brievenvangogh/let001.xml"/> <document uri="file:/Users/wiardvasen/Desktop/brievenvangogh/let002.xml"/> <document uri="file:/Users/wiardvasen/Desktop/brievenvangogh/let003.xml"/> <document uri="file:/Users/wiardvasen/Desktop/brievenvangogh/let004.xml"/> <document uri="file:/Users/wiardvasen/Desktop/brievenvangogh/let005.xml">   <hit score="0.03590675482297878">    <ab xmlns="http://www.tei-c.org/ns/1.0" rend="indent">How is your boarding-house? Is it still to your liking? That’s important. Above all, you must write more about the kind of things you see. Sunday a fortnight ago I was in Amsterdam to see an exhibition of the paintings going to Vienna from here.<anchor n="6" xml:id="note-t-6"/>It was very interesting, and I’m curious<pb f="1r" n="4" xml:id="pb-trans-1r-4" facs="#zone-pb-1r-4"/>as to the impression the Dutch will make in Vienna.</ab>   </hit>

</document> And this is the query: let $range := 1 to 5 for $doc in collection('christian') let $uri := base-uri($doc),    $num := substring($uri, string-length($uri) - 6, 3) where $num castable as xs:integer and xs:integer($num) = $range return <document uri='{$uri}'>{ for $n score $s in $doc//*[text() contains text 'above'] return <hit score='{$s}'>{ $n }</hit> }</document> Kind regards, Wiard 2011/4/5 Christian Grün <christian.gruen@gmail.com> > > Dear Wiard, > what does the query info tell you? Just copy&paste the info to this > list. > Thanks > Christian > ___________________________ > > On Tue, Apr 5, 2011 at 3:04 PM, Wiard Vasen <wiard.vasen@gmail.com> > wrote: >> >> Dear Christian, >> When I initialized the database I marked in 'Full Text' properties the >> TF >> / IDF checkbox. >> So, I think that 'score' in the query gives this score back. >> Do you think I am right? >> Thanks in advance for your answer. >> Kind regards, >> Wiard >>

Wiard Vasen

2:35 p.m.

Hi Christian,

This query gives good results on tf-scores:

ft:score(db:open("tfidfbrievenvangogh")//*[text() contains text 'man'])

But the problem is that I need the specific documents connected with the given scores.

For that reason I thought that the following query:

let $range := 1 to 640 for $doc in collection('tfidfbrievenvangogh') let $uri := base-uri($doc), $num := substring($uri, string-length($uri) - 6, 3) where $num castable as xs:integer and xs:integer($num) = $range return <document uri='{$uri}'>{ for $n score $s in $doc//*[text() contains text 'above'] return <hit score='{$s}'>{ $n }</hit> }</document>

would automatically give the tf/idf score because here 'score' is a reserved word and tf/idf where checked while initializing the full-text repository.

I am not sure whether this thought is right.

Maybe you know the answer.

Kind regards,

Wiard

2011/4/5 Christian Grün christian.gruen@gmail.com

...

...
This is the information from the Query Info panel: Compiling:

binding static variable $range

pre-evaluating collection("christian")

optimizing descendant-or-self step(s)

removing variable $range

Your query seems to be too complex to be evaluated via the full-text index (probably because the nested flwor expression); otherwise, the query info would contain the following line at least once:

applying full-text index

If you don't want to spend too much time into rewriting your query, you might as well access the index directly, such as:

for $d in collection('coll') for $x in ft:search($d, 'text') where $x/ancestor::node()[. = $d] return ft:score($x)

Hope this helps, Christian

...
Result: for $doc in (document-node { "let001.xml" }, document-node { "let002.xml" }, ...) let $uri := base-uri($doc) let $num :=

substring($uri,

...
string-length($uri) - 6, 3) where $num castable as xs:integer and $num

cast

...
as xs:integer? = 1 to 5 return element { "document" } { attribute { "uri"

}

...
{ $uri }, for $n score $s as xs:double in $doc/descendant::*[text()

contains

...
text "above"] return element { "hit" } { attribute { "score" } { $s }, $n

}

...
} Timing:

Parsing: 0.88 ms

Compiling: 5.71 ms

Evaluating: 93.72 ms

Printing: 0.27 ms

Total Time: 100.6 ms

Query plan:

<FLWR> <For var="$doc"> <sequence size="927"> <document-node() name="christian"/> <document-node() name="christian" pre="480"/> <document-node() name="christian" pre="913"/> <document-node() name="christian" pre="1897"/> <document-node() name="christian" pre="2928"/> </sequence> </For> <Let var="$uri"> <FNNode name="base-uri([node])"> <VarRef name="$doc"/> </FNNode> </Let> <Let var="$num"> <FNStr name="substring(string,start[,len])"> <VarRef name="$uri"/> <Arith op="-"> <FNAcc name="string-length([item])"> <VarRef name="$uri"/> </FNAcc> <Item value="6" type="xs:integer"/> </Arith> <Item value="3" type="xs:integer"/> </FNStr> </Let> <Where> <And> <Castable type="xs:integer"> <VarRef name="$num"/> </Castable> <CmpG op="="> <Cast type="xs:integer?"> <VarRef name="$num"/> </Cast> <Range> <Item value="1" type="xs:integer"/> <Item value="5" type="xs:integer"/> </Range> </CmpG> </And> </Where> <Return> <CElem> <Item value="document" type="xs:QName"/> <CAttr> <Item value="uri" type="xs:QName"/> <VarRef name="$uri"/> </CAttr> <FLWR> <For var="$n" score="$s as xs:double"> <AxisPath> <VarRef name="$doc"/> <IterStep axis="descendant" test="*"> <FTContains> <AxisPath> <IterStep axis="child" test="text()"/> </AxisPath> <FTWords> <Item value="above" type="xs:string"/> </FTWords> </FTContains> </IterStep> </AxisPath> </For> <Return> <CElem> <Item value="hit" type="xs:QName"/> <CAttr> <Item value="score" type="xs:QName"/> <VarRef name="$s as xs:double"/> </CAttr> <VarRef name="$n"/> </CElem> </Return> </FLWR> </CElem> </Return> </FLWR> Thanks! Regards, Wiard

2011/4/5 Christian Grün christian.gruen@gmail.com

...
Hi Wiard,

looks like you sent me the query result. To tell you if the index was utilized, I need the output from the »Query Info« panel, or (if that won't help) the original data instances.

Christian

On Tue, Apr 5, 2011 at 4:52 PM, Wiard Vasen wiard.vasen@gmail.com

wrote:

...
...
...
Hi Christian, This is the result of the query: <document uri="file:/Users/wiardvasen/Desktop/brievenvangogh/let001.xml"/> <document uri="file:/Users/wiardvasen/Desktop/brievenvangogh/let002.xml"/> <document uri="file:/Users/wiardvasen/Desktop/brievenvangogh/let003.xml"/> <document uri="file:/Users/wiardvasen/Desktop/brievenvangogh/let004.xml"/> <document uri="file:/Users/wiardvasen/Desktop/brievenvangogh/let005.xml">

<hit score="0.03590675482297878"> <ab xmlns="http://www.tei-c.org/ns/1.0" rend="indent">How is your boarding-house? Is it still to your liking? That’s important. Above

all,

...
...
...
you must write more about the kind of things you see. Sunday a fortnight

ago

...
...
...
I was in Amsterdam to see an exhibition of the paintings going to Vienna from here.<anchor n="6" xml:id="note-t-6"/>It was very interesting, and I’m curious<pb f="1r" n="4" xml:id="pb-trans-1r-4"

facs="#zone-pb-1r-4"/>as

...
...
...
to the impression the Dutch will make in Vienna.</ab>

</hit> </document> And this is the query: let $range := 1 to 5 for $doc in collection('christian') let $uri := base-uri($doc), $num := substring($uri, string-length($uri) - 6, 3) where $num castable as xs:integer and xs:integer($num) = $range return <document uri='{$uri}'>{ for $n score $s in $doc//*[text() contains text 'above'] return <hit score='{$s}'>{ $n }</hit> }</document> Kind regards, Wiard 2011/4/5 Christian Grün <christian.gruen@gmail.com> > > Dear Wiard, > what does the query info tell you? Just copy&paste the info to this > list. > Thanks > Christian > ___________________________ > > On Tue, Apr 5, 2011 at 3:04 PM, Wiard Vasen <wiard.vasen@gmail.com> > wrote: >> >> Dear Christian, >> When I initialized the database I marked in 'Full Text' properties

the

...
...
...
...
...
TF / IDF checkbox. So, I think that 'score' in the query gives this score back. Do you think I am right? Thanks in advance for your answer. Kind regards, Wiard

Andreas Weiler

3:35 p.m.

Hi Wiard,

try the following:

let $range := 1 to 640 for $doc in collection('tfidfbrievenvangogh') let $uri := base-uri($doc), $num := substring($uri, string-length($uri) - 6, 3) where $num castable as xs:integer and xs:integer($num) = $range return <document uri='{$uri}'>{ for $n in $doc//* where $n contains text 'above' return <hit score='{ft:score($n[text() contains text 'above'])}'>{ $n }</hit> }</document>

-- Andreas

Am 05.04.2011 um 20:35 schrieb Wiard Vasen:

...

Hi Christian,

This query gives good results on tf-scores:

ft:score(db:open("tfidfbrievenvangogh")//*[text() contains text 'man'])

But the problem is that I need the specific documents connected with the given scores.

For that reason I thought that the following query:

let $range := 1 to 640 for $doc in collection('tfidfbrievenvangogh') let $uri := base-uri($doc), $num := substring($uri, string-length($uri) - 6, 3) where $num castable as xs:integer and xs:integer($num) = $range return <document uri='{$uri}'>{ for $n score $s in $doc//*[text() contains text 'above'] return <hit score='{$s}'>{ $n }</hit> }</document>

would automatically give the tf/idf score because here 'score' is a reserved word and tf/idf where checked while initializing the full-text repository.

I am not sure whether this thought is right.

Maybe you know the answer.

Kind regards,

Wiard

2011/4/5 Christian Grün christian.gruen@gmail.com

...
This is the information from the Query Info panel: Compiling:

binding static variable $range

pre-evaluating collection("christian")

optimizing descendant-or-self step(s)

removing variable $range

Your query seems to be too complex to be evaluated via the full-text index (probably because the nested flwor expression); otherwise, the query info would contain the following line at least once:

applying full-text index

If you don't want to spend too much time into rewriting your query, you might as well access the index directly, such as:

for $d in collection('coll') for $x in ft:search($d, 'text') where $x/ancestor::node()[. = $d] return ft:score($x)

Hope this helps, Christian

...
Result: for $doc in (document-node { "let001.xml" }, document-node { "let002.xml" }, ...) let $uri := base-uri($doc) let $num := substring($uri, string-length($uri) - 6, 3) where $num castable as xs:integer and $num cast as xs:integer? = 1 to 5 return element { "document" } { attribute { "uri" } { $uri }, for $n score $s as xs:double in $doc/descendant::*[text() contains text "above"] return element { "hit" } { attribute { "score" } { $s }, $n } } Timing:

Parsing: 0.88 ms

Compiling: 5.71 ms

Evaluating: 93.72 ms

Printing: 0.27 ms

Total Time: 100.6 ms

Query plan:

<FLWR> <For var="$doc"> <sequence size="927"> <document-node() name="christian"/> <document-node() name="christian" pre="480"/> <document-node() name="christian" pre="913"/> <document-node() name="christian" pre="1897"/> <document-node() name="christian" pre="2928"/> </sequence> </For> <Let var="$uri"> <FNNode name="base-uri([node])"> <VarRef name="$doc"/> </FNNode> </Let> <Let var="$num"> <FNStr name="substring(string,start[,len])"> <VarRef name="$uri"/> <Arith op="-"> <FNAcc name="string-length([item])"> <VarRef name="$uri"/> </FNAcc> <Item value="6" type="xs:integer"/> </Arith> <Item value="3" type="xs:integer"/> </FNStr> </Let> <Where> <And> <Castable type="xs:integer"> <VarRef name="$num"/> </Castable> <CmpG op="="> <Cast type="xs:integer?"> <VarRef name="$num"/> </Cast> <Range> <Item value="1" type="xs:integer"/> <Item value="5" type="xs:integer"/> </Range> </CmpG> </And> </Where> <Return> <CElem> <Item value="document" type="xs:QName"/> <CAttr> <Item value="uri" type="xs:QName"/> <VarRef name="$uri"/> </CAttr> <FLWR> <For var="$n" score="$s as xs:double"> <AxisPath> <VarRef name="$doc"/> <IterStep axis="descendant" test="*"> <FTContains> <AxisPath> <IterStep axis="child" test="text()"/> </AxisPath> <FTWords> <Item value="above" type="xs:string"/> </FTWords> </FTContains> </IterStep> </AxisPath> </For> <Return> <CElem> <Item value="hit" type="xs:QName"/> <CAttr> <Item value="score" type="xs:QName"/> <VarRef name="$s as xs:double"/> </CAttr> <VarRef name="$n"/> </CElem> </Return> </FLWR> </CElem> </Return> </FLWR> Thanks! Regards, Wiard

2011/4/5 Christian Grün christian.gruen@gmail.com

...
Hi Wiard,

looks like you sent me the query result. To tell you if the index was utilized, I need the output from the »Query Info« panel, or (if that won't help) the original data instances.

Christian

On Tue, Apr 5, 2011 at 4:52 PM, Wiard Vasen wiard.vasen@gmail.com wrote:

...
Hi Christian, This is the result of the query: <document uri="file:/Users/wiardvasen/Desktop/brievenvangogh/let001.xml"/> <document uri="file:/Users/wiardvasen/Desktop/brievenvangogh/let002.xml"/> <document uri="file:/Users/wiardvasen/Desktop/brievenvangogh/let003.xml"/> <document uri="file:/Users/wiardvasen/Desktop/brievenvangogh/let004.xml"/> <document uri="file:/Users/wiardvasen/Desktop/brievenvangogh/let005.xml">

<hit score="0.03590675482297878"> <ab xmlns="http://www.tei-c.org/ns/1.0" rend="indent">How is your boarding-house? Is it still to your liking? That’s important. Above all, you must write more about the kind of things you see. Sunday a fortnight ago I was in Amsterdam to see an exhibition of the paintings going to Vienna from here.<anchor n="6" xml:id="note-t-6"/>It was very interesting, and I’m curious<pb f="1r" n="4" xml:id="pb-trans-1r-4" facs="#zone-pb-1r-4"/>as to the impression the Dutch will make in Vienna.</ab> </hit> </document> And this is the query: let $range := 1 to 5 for $doc in collection('christian') let $uri := base-uri($doc), $num := substring($uri, string-length($uri) - 6, 3) where $num castable as xs:integer and xs:integer($num) = $range return <document uri='{$uri}'>{ for $n score $s in $doc//*[text() contains text 'above'] return <hit score='{$s}'>{ $n }</hit> }</document> Kind regards, Wiard 2011/4/5 Christian Grün <christian.gruen@gmail.com> > > Dear Wiard, > what does the query info tell you? Just copy&paste the info to this > list. > Thanks > Christian > ___________________________ > > On Tue, Apr 5, 2011 at 3:04 PM, Wiard Vasen <wiard.vasen@gmail.com> > wrote: >> >> Dear Christian, >> When I initialized the database I marked in 'Full Text' properties the >> TF >> / IDF checkbox. >> So, I think that 'score' in the query gives this score back. >> Do you think I am right? >> Thanks in advance for your answer. >> Kind regards, >> Wiard >>

Wiard Vasen

6 Apr 6 Apr

2:15 a.m.

Hi Andreas,

Thanks a lot for your solution! I am going to look at the result to see if it gives the right scores and come back on you.

Kind regards,

Wiard

2011/4/5 Andreas Weiler andreas.weiler@uni-konstanz.de

...

Hi Wiard,

try the following:

let $range := 1 to 640 for $doc in collection('tfidfbrievenvangogh') let $uri := base-uri($doc), $num := substring($uri, string-length($uri) - 6, 3) where $num castable as xs:integer and xs:integer($num) = $range return <document uri='{$uri}'>{ for $n in $doc//* where $n contains text 'above' return <hit score='{ft:score($n[text() contains text 'above'])}'>{ $n }</hit> }</document>

-- Andreas

Am 05.04.2011 um 20:35 schrieb Wiard Vasen:

Hi Christian,

This query gives good results on tf-scores:

ft:score(db:open("tfidfbrievenvangogh")//*[text() contains text 'man'])

But the problem is that I need the specific documents connected with the given scores.

For that reason I thought that the following query:

let $range := 1 to 640 for $doc in collection('tfidfbrievenvangogh') let $uri := base-uri($doc), $num := substring($uri, string-length($uri) - 6, 3) where $num castable as xs:integer and xs:integer($num) = $range return <document uri='{$uri}'>{ for $n score $s in $doc//*[text() contains text 'above'] return <hit score='{$s}'>{ $n }</hit> }</document>

would automatically give the tf/idf score because here 'score' is a reserved word and tf/idf where checked while initializing the full-text repository.

I am not sure whether this thought is right.

Maybe you know the answer.

Kind regards,

Wiard

2011/4/5 Christian Grün christian.gruen@gmail.com

...
...
This is the information from the Query Info panel: Compiling:

binding static variable $range

pre-evaluating collection("christian")

optimizing descendant-or-self step(s)

removing variable $range

Your query seems to be too complex to be evaluated via the full-text index (probably because the nested flwor expression); otherwise, the query info would contain the following line at least once:

applying full-text index

If you don't want to spend too much time into rewriting your query, you might as well access the index directly, such as:

for $d in collection('coll') for $x in ft:search($d, 'text') where $x/ancestor::node()[. = $d] return ft:score($x)

Hope this helps, Christian

...
Result: for $doc in (document-node { "let001.xml" }, document-node { "let002.xml" }, ...) let $uri := base-uri($doc) let $num :=

substring($uri,

...
string-length($uri) - 6, 3) where $num castable as xs:integer and $num

cast

...
as xs:integer? = 1 to 5 return element { "document" } { attribute {

"uri" }

...
{ $uri }, for $n score $s as xs:double in $doc/descendant::*[text()

contains

...
text "above"] return element { "hit" } { attribute { "score" } { $s },

$n }

...
} Timing:

Parsing: 0.88 ms

Compiling: 5.71 ms

Evaluating: 93.72 ms

Printing: 0.27 ms

Total Time: 100.6 ms

Query plan:

<FLWR> <For var="$doc"> <sequence size="927"> <document-node() name="christian"/> <document-node() name="christian" pre="480"/> <document-node() name="christian" pre="913"/> <document-node() name="christian" pre="1897"/> <document-node() name="christian" pre="2928"/> </sequence> </For> <Let var="$uri"> <FNNode name="base-uri([node])"> <VarRef name="$doc"/> </FNNode> </Let> <Let var="$num"> <FNStr name="substring(string,start[,len])"> <VarRef name="$uri"/> <Arith op="-"> <FNAcc name="string-length([item])"> <VarRef name="$uri"/> </FNAcc> <Item value="6" type="xs:integer"/> </Arith> <Item value="3" type="xs:integer"/> </FNStr> </Let> <Where> <And> <Castable type="xs:integer"> <VarRef name="$num"/> </Castable> <CmpG op="="> <Cast type="xs:integer?"> <VarRef name="$num"/> </Cast> <Range> <Item value="1" type="xs:integer"/> <Item value="5" type="xs:integer"/> </Range> </CmpG> </And> </Where> <Return> <CElem> <Item value="document" type="xs:QName"/> <CAttr> <Item value="uri" type="xs:QName"/> <VarRef name="$uri"/> </CAttr> <FLWR> <For var="$n" score="$s as xs:double"> <AxisPath> <VarRef name="$doc"/> <IterStep axis="descendant" test="*"> <FTContains> <AxisPath> <IterStep axis="child" test="text()"/> </AxisPath> <FTWords> <Item value="above" type="xs:string"/> </FTWords> </FTContains> </IterStep> </AxisPath> </For> <Return> <CElem> <Item value="hit" type="xs:QName"/> <CAttr> <Item value="score" type="xs:QName"/> <VarRef name="$s as xs:double"/> </CAttr> <VarRef name="$n"/> </CElem> </Return> </FLWR> </CElem> </Return> </FLWR> Thanks! Regards, Wiard

2011/4/5 Christian Grün christian.gruen@gmail.com

...
Hi Wiard,

looks like you sent me the query result. To tell you if the index was utilized, I need the output from the »Query Info« panel, or (if that won't help) the original data instances.

Christian

On Tue, Apr 5, 2011 at 4:52 PM, Wiard Vasen wiard.vasen@gmail.com

wrote:

...
...
...
Hi Christian, This is the result of the query: <document uri="file:/Users/wiardvasen/Desktop/brievenvangogh/let001.xml"/> <document uri="file:/Users/wiardvasen/Desktop/brievenvangogh/let002.xml"/> <document uri="file:/Users/wiardvasen/Desktop/brievenvangogh/let003.xml"/> <document uri="file:/Users/wiardvasen/Desktop/brievenvangogh/let004.xml"/> <document uri="file:/Users/wiardvasen/Desktop/brievenvangogh/let005.xml">

<hit score="0.03590675482297878"> <ab xmlns="http://www.tei-c.org/ns/1.0" rend="indent">How is

your

...
...
...
boarding-house? Is it still to your liking? That’s important. Above

all,

...
...
...
you must write more about the kind of things you see. Sunday a fortnight

ago

...
...
...
I was in Amsterdam to see an exhibition of the paintings going to

Vienna

...
...
...
from here.<anchor n="6" xml:id="note-t-6"/>It was very interesting, and

I’m

...
...
...
curious<pb f="1r" n="4" xml:id="pb-trans-1r-4"

facs="#zone-pb-1r-4"/>as

...
...
...
to the impression the Dutch will make in Vienna.</ab>

</hit> </document> And this is the query: let $range := 1 to 5 for $doc in collection('christian') let $uri := base-uri($doc), $num := substring($uri, string-length($uri) - 6, 3) where $num castable as xs:integer and xs:integer($num) = $range return <document uri='{$uri}'>{ for $n score $s in $doc//*[text() contains text 'above'] return <hit score='{$s}'>{ $n }</hit> }</document> Kind regards, Wiard 2011/4/5 Christian Grün <christian.gruen@gmail.com> > > Dear Wiard, > what does the query info tell you? Just copy&paste the info to this > list. > Thanks > Christian > ___________________________ > > On Tue, Apr 5, 2011 at 3:04 PM, Wiard Vasen <wiard.vasen@gmail.com> > wrote: >> >> Dear Christian, >> When I initialized the database I marked in 'Full Text' properties

the

...
...
...
...
> TF > / IDF checkbox. > So, I think that 'score' in the query gives this score back. > Do you think I am right? > Thanks in advance for your answer. > Kind regards, > Wiard >

Wiard Vasen

11:30 a.m.

Hi Andreas,

I think it all works. I thank you very much!

Regards,

Wiard

2011/4/6 Wiard Vasen wiard.vasen@gmail.com

...

Hi Andreas,

Thanks a lot for your solution! I am going to look at the result to see if it gives the right scores and come back on you.

Kind regards,

Wiard

2011/4/5 Andreas Weiler andreas.weiler@uni-konstanz.de

...
Hi Wiard,

try the following:

let $range := 1 to 640 for $doc in collection('tfidfbrievenvangogh') let $uri := base-uri($doc), $num := substring($uri, string-length($uri) - 6, 3) where $num castable as xs:integer and xs:integer($num) = $range return <document uri='{$uri}'>{ for $n in $doc//* where $n contains text 'above' return <hit score='{ft:score($n[text() contains text 'above'])}'>{ $n }</hit> }</document>

-- Andreas

Am 05.04.2011 um 20:35 schrieb Wiard Vasen:

Hi Christian,

This query gives good results on tf-scores:

ft:score(db:open("tfidfbrievenvangogh")//*[text() contains text 'man'])

But the problem is that I need the specific documents connected with the given scores.

For that reason I thought that the following query:

let $range := 1 to 640 for $doc in collection('tfidfbrievenvangogh') let $uri := base-uri($doc), $num := substring($uri, string-length($uri) - 6, 3) where $num castable as xs:integer and xs:integer($num) = $range return <document uri='{$uri}'>{ for $n score $s in $doc//*[text() contains text 'above'] return <hit score='{$s}'>{ $n }</hit> }</document>

would automatically give the tf/idf score because here 'score' is a reserved word and tf/idf where checked while initializing the full-text repository.

I am not sure whether this thought is right.

Maybe you know the answer.

Kind regards,

Wiard

2011/4/5 Christian Grün christian.gruen@gmail.com

...
...
This is the information from the Query Info panel: Compiling:

binding static variable $range

pre-evaluating collection("christian")

optimizing descendant-or-self step(s)

removing variable $range

Your query seems to be too complex to be evaluated via the full-text index (probably because the nested flwor expression); otherwise, the query info would contain the following line at least once:

applying full-text index

If you don't want to spend too much time into rewriting your query, you might as well access the index directly, such as:

for $d in collection('coll') for $x in ft:search($d, 'text') where $x/ancestor::node()[. = $d] return ft:score($x)

Hope this helps, Christian

...
Result: for $doc in (document-node { "let001.xml" }, document-node { "let002.xml" }, ...) let $uri := base-uri($doc) let $num :=

substring($uri,

...
string-length($uri) - 6, 3) where $num castable as xs:integer and $num

cast

...
as xs:integer? = 1 to 5 return element { "document" } { attribute {

"uri" }

...
{ $uri }, for $n score $s as xs:double in $doc/descendant::*[text()

contains

...
text "above"] return element { "hit" } { attribute { "score" } { $s },

$n }

...
} Timing:

Parsing: 0.88 ms

Compiling: 5.71 ms

Evaluating: 93.72 ms

Printing: 0.27 ms

Total Time: 100.6 ms

Query plan:

<FLWR> <For var="$doc"> <sequence size="927"> <document-node() name="christian"/> <document-node() name="christian" pre="480"/> <document-node() name="christian" pre="913"/> <document-node() name="christian" pre="1897"/> <document-node() name="christian" pre="2928"/> </sequence> </For> <Let var="$uri"> <FNNode name="base-uri([node])"> <VarRef name="$doc"/> </FNNode> </Let> <Let var="$num"> <FNStr name="substring(string,start[,len])"> <VarRef name="$uri"/> <Arith op="-"> <FNAcc name="string-length([item])"> <VarRef name="$uri"/> </FNAcc> <Item value="6" type="xs:integer"/> </Arith> <Item value="3" type="xs:integer"/> </FNStr> </Let> <Where> <And> <Castable type="xs:integer"> <VarRef name="$num"/> </Castable> <CmpG op="="> <Cast type="xs:integer?"> <VarRef name="$num"/> </Cast> <Range> <Item value="1" type="xs:integer"/> <Item value="5" type="xs:integer"/> </Range> </CmpG> </And> </Where> <Return> <CElem> <Item value="document" type="xs:QName"/> <CAttr> <Item value="uri" type="xs:QName"/> <VarRef name="$uri"/> </CAttr> <FLWR> <For var="$n" score="$s as xs:double"> <AxisPath> <VarRef name="$doc"/> <IterStep axis="descendant" test="*"> <FTContains> <AxisPath> <IterStep axis="child" test="text()"/> </AxisPath> <FTWords> <Item value="above" type="xs:string"/> </FTWords> </FTContains> </IterStep> </AxisPath> </For> <Return> <CElem> <Item value="hit" type="xs:QName"/> <CAttr> <Item value="score" type="xs:QName"/> <VarRef name="$s as xs:double"/> </CAttr> <VarRef name="$n"/> </CElem> </Return> </FLWR> </CElem> </Return> </FLWR> Thanks! Regards, Wiard

2011/4/5 Christian Grün christian.gruen@gmail.com

...
Hi Wiard,

looks like you sent me the query result. To tell you if the index was utilized, I need the output from the »Query Info« panel, or (if that won't help) the original data instances.

Christian

On Tue, Apr 5, 2011 at 4:52 PM, Wiard Vasen wiard.vasen@gmail.com

wrote:

...
...
...
Hi Christian, This is the result of the query: <document uri="file:/Users/wiardvasen/Desktop/brievenvangogh/let001.xml"/> <document uri="file:/Users/wiardvasen/Desktop/brievenvangogh/let002.xml"/> <document uri="file:/Users/wiardvasen/Desktop/brievenvangogh/let003.xml"/> <document uri="file:/Users/wiardvasen/Desktop/brievenvangogh/let004.xml"/> <document uri="file:/Users/wiardvasen/Desktop/brievenvangogh/let005.xml">

<hit score="0.03590675482297878"> <ab xmlns="http://www.tei-c.org/ns/1.0" rend="indent">How is

your

...
...
...
boarding-house? Is it still to your liking? That’s important. Above

all,

...
...
...
you must write more about the kind of things you see. Sunday a fortnight

ago

...
...
...
I was in Amsterdam to see an exhibition of the paintings going to

Vienna

...
...
...
from here.<anchor n="6" xml:id="note-t-6"/>It was very interesting, and

I’m

...
...
...
curious<pb f="1r" n="4" xml:id="pb-trans-1r-4"

facs="#zone-pb-1r-4"/>as

...
...
...
to the impression the Dutch will make in Vienna.</ab>

</hit> </document> And this is the query: let $range := 1 to 5 for $doc in collection('christian') let $uri := base-uri($doc), $num := substring($uri, string-length($uri) - 6, 3) where $num castable as xs:integer and xs:integer($num) = $range return <document uri='{$uri}'>{ for $n score $s in $doc//*[text() contains text 'above'] return <hit score='{$s}'>{ $n }</hit> }</document> Kind regards, Wiard 2011/4/5 Christian Grün <christian.gruen@gmail.com> > > Dear Wiard, > what does the query info tell you? Just copy&paste the info to this > list. > Thanks > Christian > ___________________________ > > On Tue, Apr 5, 2011 at 3:04 PM, Wiard Vasen <wiard.vasen@gmail.com

...
...
> wrote: >> >> Dear Christian, >> When I initialized the database I marked in 'Full Text' properties

the

...
...
...
>> TF >> / IDF checkbox. >> So, I think that 'score' in the query gives this score back. >> Do you think I am right? >> Thanks in advance for your answer. >> Kind regards, >> Wiard >>

Leonard Wörteler

3 Apr 3 Apr

1:03 p.m.

Dear Wiard,

Am 03.04.2011 18:44, schrieb Wiard Vasen:

...

for $n at $pos in db:open("tfidfbrievenvangogh") where $pos > ( "let001.xml") and $pos < ( "let201.xml") return for $i in $n//* where $i[text() contains text 'above'] return <hit>{base-uri($i)}<score>{ft:score($i[text() contains text 'above'])}</score></hit>

does this (rather drastical) rewriting approximately do what you want?

I added some structure to the output (which you can easily remove), used only standard functions and optimized the query for performance where I saw fit.

Hope that helps, cheers Leo

Wiard Vasen

1:19 p.m.

Thank you Leo,

It works perfectly! You have helped me a lot.

Have a nice day!

Regards,

Wiard

2011/4/3 Leonard Wörteler leonard.woerteler@uni-konstanz.de

...

Dear Wiard,

Am 03.04.2011 18:44, schrieb Wiard Vasen:

...
for $n at $pos in db:open("tfidfbrievenvangogh") where $pos > ( "let001.xml") and $pos < ( "let201.xml") return for $i in $n//* where $i[text() contains text 'above'] return <hit>{base-uri($i)}<score>{ft:score($i[text() contains text 'above'])}</score></hit>

does this (rather drastical) rewriting approximately do what you want?

let $range := 1 to 201 for $doc in collection('tfidfbrievenvangogh') let $uri := base-uri($doc), $num := substring($uri, string-length($uri) - 6, 3) where $num castable as xs:integer and xs:integer($num) = $range return <document uri='{$uri}'>{ for $n score $s in $doc//*[text() contains text 'above'] return <hit score='{$s}'>{ $n }</hit> }</document

I added some structure to the output (which you can easily remove), used only standard functions and optimized the query for performance where I saw fit.

Hope that helps, cheers Leo

5216

Age (days ago)

5223

Last active (days ago)

basex-talk@mailman.uni-konstanz.de

25 comments

4 participants

tags (0)

participants (4)

Andreas Weiler
Christian Grün
Leonard Wörteler
Wiard Vasen