ft:mark and differing result sets on 10.2

List overview All Threads
Download

newer

older

How to check if a node is from a...

Vulnerabilities in docker image

Chris Yocum

2 Oct 2022 2 Oct '22

5:41 a.m.

Hi,

I had a chance to come back to basex after a while when I needed to do some querying on my data set. I updated to 10.2 and loaded the files. This was all fine.

I then ran this query:

for $x in db:get($db)/sample/entry return ft:mark($x[descendant::text() contains text {'fas'} using wildcards])

which ran in 2528.78ms and return 170 results. This seemed rather slow so I started to work on it. I also ran this:

for $x in db:get($db)/sample/entry return $x[descendant::text() contains text {'fas'} using wildcards]

which ran in 57.77ms and returned 35 results. I was very suprised by this as I had expected it to run in the same time and return the same result set. I, then, started look deeper and it seems like ft:mark does more than add an XML tag around the matched token?

My question is this: does ft:mark do more than what I had expected it to do? Am I misunderstanding or miswriting this query?

Thank you and all the best, Chris

Show replies by date

Christopher Yocum

2 Oct 2 Oct

8:26 a.m.

Hi Everyone,

I just had a follow up on the issue that I had. I think I figured out at least one thing. I had diacritics set as true so it was only returning those results without diacritics so I was able to get the two queries to return the same results and go faster. However I ran into a new interesting problem. When I have this:

ft:mark(db:get($db)/sample/entry[descendant::text() contains text 'fas'])

I see "- apply full-text index for "fas" using language """ which takes 164.24 ms but when I have this:

ft:mark(db:get('edil-new')/sample/entry[descendant::text() contains text 'fas' using wildcards])

I do not see the full-text index being applied and it takes 2283.14 ms (much like I had seen before).

My expectation is that having a full text option would have not triggered such a large change in the performance. This is also true when I haven't actually used any wildcards in the query but just turned the option on. Would this be expected behaviour?

Thanks and all the best, Chris

On Sun, Oct 2, 2022 at 10:41 AM Chris Yocum cyocum@gmail.com wrote:

...

Hi,

I had a chance to come back to basex after a while when I needed to do some querying on my data set. I updated to 10.2 and loaded the files. This was all fine.

I then ran this query:

for $x in db:get($db)/sample/entry return ft:mark($x[descendant::text() contains text {'fas'} using wildcards])

which ran in 2528.78ms and return 170 results. This seemed rather slow so I started to work on it. I also ran this:

for $x in db:get($db)/sample/entry return $x[descendant::text() contains text {'fas'} using wildcards]

which ran in 57.77ms and returned 35 results. I was very suprised by this as I had expected it to run in the same time and return the same result set. I, then, started look deeper and it seems like ft:mark does more than add an XML tag around the matched token?

My question is this: does ft:mark do more than what I had expected it to do? Am I misunderstanding or miswriting this query?

Thank you and all the best, Chris

Christian Grün

5 Oct 5 Oct

9:25 a.m.

Hi Chris,

...

for $x in db:get($db)/sample/entry return ft:mark($x[descendant::text() contains text {'fas'} using wildcards])

which ran in 2528.78ms and return 170 results. This seemed rather slow so I started to work on it. I also ran this:

for $x in db:get($db)/sample/entry return $x[descendant::text() contains text {'fas'} using wildcards]

which ran in 57.77ms and returned 35 results. I was very suprised by this as I had expected it to run in the same time and return the same result set.

It’s tricky for the optimizer to rewrite the first expression in a way which both finds results in the index and marks the results. One common (albeit not obvious) solution is to define a search function and call it twice:

let $test := function($node) { $node/descendant::text() contains text { $fas } using wildcards } for $entry in db:get($db)/sample/entry[$test(.)] return ft:mark($entry[$test(.)])

The query optimizer will inline the code and can then rewrite it for index access.

...

ft:mark(db:get('edil-new')/sample/entry[descendant::text() contains text 'fas' using wildcards])

I do not see the full-text index being applied and it takes 2283.14 ms (much like I had seen before).

My assumption would be that edil-new contains no full-text index (?).

Best, Christian

Chris Yocum

12:55 p.m.

Hi Christian,

Thank you very much for your comments on this. I ended up doing this:

ft:mark(ft:search("edil-new", "fas", map {"wildcards":"true"})/ancestor::entry)

which gives me everything that I was expecting and is performant as expected. Basically, I took this:

for $x in db:get('edil-new')/sample/entry return $x[descendant::text() contains text {'fas'}]

then looked at the Optimised Query in the GUI and replicated that and translated the "using wildcards" into the map from the documentation.

Writing query optimizers is very hard. I have never tried my hand at it but writing them is an art and a science from what I have read.

Thanks and all the best, Chris

On Wed, Oct 05, 2022 at 03:25:31PM +0200, Christian Grün wrote:

...

Hi Chris,

...
for $x in db:get($db)/sample/entry return ft:mark($x[descendant::text() contains text {'fas'} using wildcards])

which ran in 2528.78ms and return 170 results. This seemed rather slow so I started to work on it. I also ran this:

for $x in db:get($db)/sample/entry return $x[descendant::text() contains text {'fas'} using wildcards]

which ran in 57.77ms and returned 35 results. I was very suprised by this as I had expected it to run in the same time and return the same result set.

It’s tricky for the optimizer to rewrite the first expression in a way which both finds results in the index and marks the results. One common (albeit not obvious) solution is to define a search function and call it twice:

let $test := function($node) { $node/descendant::text() contains text { $fas } using wildcards } for $entry in db:get($db)/sample/entry[$test(.)] return ft:mark($entry[$test(.)])

The query optimizer will inline the code and can then rewrite it for index access.

...
ft:mark(db:get('edil-new')/sample/entry[descendant::text() contains text 'fas' using wildcards])

I do not see the full-text index being applied and it takes 2283.14 ms (much like I had seen before).

My assumption would be that edil-new contains no full-text index (?).

Best, Christian

1016

Age (days ago)

1019

Last active (days ago)

basex-talk@mailman.uni-konstanz.de

3 comments

3 participants

tags (0)

participants (3)

Chris Yocum
Christian Grün
Christopher Yocum