Full-text speed

List overview All Threads
Download

newer

older

Preserving CDATA sections

How to Connect to BaseX in Java

Thomas Goossens

10 Feb 2010 10 Feb '10

6:24 p.m.

Hello,

I am trying XQuery Full-text on BaseX and I am a bit surprised by the full-text query speed: I have loaded the Shakespeare plays into a BaseX database, and created a full-text index. So far so good.

Then I a tried a query like: //LINE[ . contains text "romeo juliet" all words] (4 hits)

It takes about 1200 ms. I expected less than 100 ms. For example I tried Qizx and it takes less than 20 ms. Even eXist (old version, with a different syntax) was taking around 200 ms.

I tried dropping the full-text index: that makes no difference! So clearly the FT index is not used. What should I do ?

Thanks

Attachments:

attachment.html (text/html — 689 bytes)

Show replies by date

Christian Grün

10 Feb 10 Feb

6:42 p.m.

Hi Thomas,

your query will be evaluated much faster if you rewrite it to..

//LINE[ text() contains text "romeo juliet"]

This query should take ~3-5 ms on the 7.5mb Shakespeare instance.

You can have a look into our XQuery documentation (http://basex.org/xquery, Section »Query Evaluation«) to get more insight on query compilation and how to utilize the index structures.

Hope this helps, Christian ___________________________

Christian Gruen Universitaet Konstanz Department of Computer & Information Science D-78457 Konstanz, Germany Tel: +49 (0)7531/88-4449, Fax: +49 (0)7531/88-3577 http://www.inf.uni-konstanz.de/~gruen

On Thu, Feb 11, 2010 at 12:24 AM, Thomas Goossens thomgooss@gmail.com wrote:

...

Hello,

I am trying XQuery Full-text on BaseX and I am a bit surprised by the full-text query speed: I have loaded the Shakespeare plays into a BaseX database, and created a full-text index. So far so good.

Then I a tried a query like: //LINE[ . contains text "romeo juliet" all words] (4 hits)

It takes about 1200 ms. I expected less than 100 ms. For example I tried Qizx and it takes less than 20 ms. Even eXist (old version, with a different syntax) was taking around 200 ms.

I tried dropping the full-text index: that makes no difference! So clearly the FT index is not used. What should I do ?

Thanks

Thomas Goossens

7:14 p.m.

---------- Forwarded message ---------- From: Christian Grün christian.gruen@gmail.com Date: Thu, Feb 11, 2010 at 12:42 AM Subject: Re: [basex-talk] Full-text speed To: Thomas Goossens thomgooss@gmail.com Cc: basex-talk@mailman.uni-konstanz.de

Hi Thomas,

your query will be evaluated much faster if you rewrite it to..

//LINE[ text() contains text "romeo juliet"]

This query should take ~3-5 ms on the 7.5mb Shakespeare instance.

You can have a look into our XQuery documentation (http://basex.org/xquery, Section »Query Evaluation«) to get more insight on query compilation and how to utilize the index structures.

Hope this helps, Christian ___________________________

Christian Gruen Universitaet Konstanz Department of Computer & Information Science D-78457 Konstanz, Germany Tel: +49 (0)7531/88-4449, Fax: +49 (0)7531/88-3577 http://www.inf.uni-konstanz.de/~gruen http://www.inf.uni-konstanz.de/%7Egruen

On Thu, Feb 11, 2010 at 12:24 AM, Thomas Goossens thomgooss@gmail.com wrote:

...

Hello,

I am trying XQuery Full-text on BaseX and I am a bit surprised by the full-text query speed: I have loaded the Shakespeare plays into a BaseX database, and created a full-text index. So far so good.

Then I a tried a query like: //LINE[ . contains text "romeo juliet" all words] (4 hits)

It takes about 1200 ms. I expected less than 100 ms. For example I tried Qizx and it takes less than 20 ms. Even eXist (old version, with a different syntax) was taking around 200

ms.

...

I tried dropping the full-text index: that makes no difference! So clearly the FT index is not used. What should I do ?

Thanks

Christian Grün

8 p.m.

...

that's not my query: I was using 'all words', and my query still runs

without index.

Correct; first of all, the following sub-expressions are equivalent:

– "romeo juliet" all words – "romeo" ftand "juliet"

..but in fact "text()" is not equivalent to the context item ".". The following expression shows another alternative..

a) //SPEECH[ .//text() contains text "romeo" ftand "juliet"]

..but to really get equivalent results, you should go along with:

b) //SPEECH[ .//text() contains text "romeo"][ .//text() contains text "juliet"]

...

I think that 'text() 'is not equivalent to '.'.

For example I could not use text() here //SPEECH[ . contains text "romeo juliet" all words] By the way, this query just doesn't work (while it should return 42 hits).

This might be due to the phenomena of node atomization, which is handled differently by all implementations. The following query..

<xml>A<x>B</x></xml> contains text 'A'

..returns "false" in BaseX, whereas other implementations might return "true", if "A" and "B" are handles as two independent tokens. If you apply query b) above, you might get the results you are expecting.

Best, Christian

___________________________

Christian Gruen Universitaet Konstanz Department of Computer & Information Science D-78457 Konstanz, Germany Tel: +49 (0)7531/88-4449, Fax: +49 (0)7531/88-3577 http://www.inf.uni-konstanz.de/~gruen

On Thu, Feb 11, 2010 at 1:14 AM, Thomas Goossens thomgooss@gmail.com wrote:

...

---------- Forwarded message ---------- From: Christian Grün christian.gruen@gmail.com Date: Thu, Feb 11, 2010 at 12:42 AM Subject: Re: [basex-talk] Full-text speed To: Thomas Goossens thomgooss@gmail.com Cc: basex-talk@mailman.uni-konstanz.de

Hi Thomas,

your query will be evaluated much faster if you rewrite it to..

//LINE[ text() contains text "romeo juliet"]

This query should take ~3-5 ms on the 7.5mb Shakespeare instance.

You can have a look into our XQuery documentation (http://basex.org/xquery, Section »Query Evaluation«) to get more insight on query compilation and how to utilize the index structures.

Hope this helps, Christian ___________________________

Christian Gruen Universitaet Konstanz Department of Computer & Information Science D-78457 Konstanz, Germany Tel: +49 (0)7531/88-4449, Fax: +49 (0)7531/88-3577 http://www.inf.uni-konstanz.de/~gruen

On Thu, Feb 11, 2010 at 12:24 AM, Thomas Goossens thomgooss@gmail.com wrote:

...
Hello,

I am trying XQuery Full-text on BaseX and I am a bit surprised by the full-text query speed: I have loaded the Shakespeare plays into a BaseX database, and created a full-text index. So far so good.

Then I a tried a query like: //LINE[ . contains text "romeo juliet" all words] (4 hits)

It takes about 1200 ms. I expected less than 100 ms. For example I tried Qizx and it takes less than 20 ms. Even eXist (old version, with a different syntax) was taking around 200 ms.

I tried dropping the full-text index: that makes no difference! So clearly the FT index is not used. What should I do ?

Thanks

BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk

Thomas Goossens

11 Feb 11 Feb

9:49 a.m.

On Thu, Feb 11, 2010 at 2:00 AM, Christian Grün christian.gruen@gmail.comwrote:

...

..but to really get equivalent results, you should go along with:

b) //SPEECH[ .//text() contains text "romeo"][ .//text() contains text "juliet"]

OK, thanks, Christian, but: 1) I suspect that programmers in XQuery FT will not like to rewrite their query until it works. 2) There are some cases when you cannot rewrite. For example imagine you have a paragraph like that: This is a really strong statement! and that the query is: //p[ . contains text "really strong statement"]

I don't see a way to rewrite using text() so that it works in the general case. And from what I understand of XQuery FT, that query should work in any implementation.

I have the feeling that currently, BaseX cannot match a FT query accross several text() nodes, am I wrong?

...

This might be due to the phenomena of node atomization, which is handled differently by all implementations. The following query..

<xml>A<x>B</x></xml> contains text 'A'

..returns "false" in BaseX, whereas other implementations might return "true", if "A" and "B" are handles as two independent tokens. If you apply query b) above, you might get the results you are expecting.

Sorry, I am confused. Why do you speak of 'atomization' ?

I really think that all implementations should recognize "romeo" and "juliet" as independent words in Shakespeare's plays...

mit besten Grüßen

Christian Grün

3:05 p.m.

...

I suspect that programmers in XQuery FT will not like to rewrite their

query until it works. ... I don't see a way to rewrite using text() so that it works in the general case.

Note that all XQuery Full Text queries "work" in BaseX, but not all of them take advantage of the optional full-text index. The reason is that we initially put most effort on a 100% compliance with the XQFT specification – and, to the best of our knowledge, we are still the only implementation that complies 100% with the specs (other implementations are coming closer, though) – and we are gradually increasing the number of XQuery expressions that are recognized by the query optimizer.

...

I have the feeling that currently, BaseX cannot match a FT query accross several text() nodes, am I wrong?

...they won't utilize the index.

...

Sorry, I am confused. Why do you speak of 'atomization' ? I really think that all implementations should recognize "romeo" and "juliet" as independent words in Shakespeare's plays...

By default, whitespace nodes are chopped by the BaseX XML parser; that's why snippets like...

<SPEAKER>ROMEO</SPEAKER><LINE>Is the day so young?</LINE>

..are tokenized to "romeois", "the", "day", etc. This may look pretty weird, but it makes sense if you look at examples like..

"This is funny" contains text "This is funny"

..which will return "false" in some other implementations. Both approaches are correct, as the specification says that "Implementations are free to provide implementation-defined ways to differentiate between the markup's effect on token boundaries during tokenization" (http://www.w3.org/TR/2010/CR-xpath-full-text-10-20100128/#tq-ftsearch-xml).

Feel free to ask for more, Christian

Thomas Goossens

12 Feb 12 Feb

9:45 a.m.

...

Note that all XQuery Full Text queries "work" in BaseX,

No, sorry. I mentioned a query that doesn't, with or without index: //SPEECH[ . contains text "romeo juliet" all words] It *should* return 42 items on the whole collection of 37 plays, and it returns *nothing*. And no relation with the way BaseX chops words.

...

but not all of them take advantage of the optional full-text index. The reason is that we initially put most effort on a 100% compliance with the XQFT specification – and, to the best of our knowledge, we are still the only implementation that complies 100% with the specs

With the specs, fine, but with the tests? ;-)

...

implementations are coming closer, though) – and we are gradually increasing the number of XQuery expressions that are recognized by the query optimizer.

Perfekt.

...

By default, whitespace nodes are chopped by the BaseX XML parser; that's why snippets like...

<SPEAKER>ROMEO</SPEAKER><LINE>Is the day so young?</LINE>

..are tokenized to "romeois", "the", "day", etc. This may look pretty weird, but it makes sense if you look at examples like..

"This is funny" contains text "This is funny"

Well this is funny indeed. If I am not mistaken, that means that BaseX would find "This" in the 2nd example but not "Romeo" in the first example. I guess that words crossing an element tag is something very rare. So in other terms BaseX works well in a very uncommon situation, but fails in much more likely cases... Well, it is your business.

Sorry if I am a bit ironical. BaseX is still an impressive product in quite a few aspects. But I think I will go on with Qizx for a while.

Regards

Christian Grün

10:09 a.m.

...

//SPEECH[ . contains text "romeo juliet" all words] It *should* return 42 items on the whole collection of 37 plays, and it returns *nothing*. And no relation with the way BaseX chops words.

That's somewhat unusual, as I can't reproduce this bug here. Feel free to pass me on your input files.

...

Sorry if I am a bit ironical.

No reason to be sorry; that's your deliberate choice.

Christian

Philippe Poulard

16 Feb 16 Feb

3:22 a.m.

Hi,

Thomas Goossens wrote:

...

By default, whitespace nodes are chopped by the BaseX XML parser;
that's why snippets like...

<SPEAKER>ROMEO</SPEAKER><LINE>Is the day so young?</LINE>

..are tokenized to "romeois", "the", "day", etc. This may look pretty
weird, but it makes sense if you look at examples like..

 "This is funny" contains text "This is funny"
Well this is funny indeed. If I am not mistaken, that means that BaseX would find "This" in the 2nd example but not "Romeo" in the first example. I guess that words crossing an element tag is something very rare. So in other terms BaseX works well in a very uncommon situation, but fails in much more likely cases... Well, it is your business.

Perhaps it would be better if an option would let the user decide which behaviour the BaseX XML parser should apply.

To go further, an adaptative behaviour would be usefull for widely-used XML languages, such as XHTML or Docbook: , <div>, and block-level elements : tokenize with that boundaries , , and other inline level elements: tokenize without that boundaries

-- Cordialement, /// (. .) --------ooO--(_)--Ooo-------- | Philippe Poulard | ----------------------------- http://reflex.gforge.inria.fr/ Have the RefleX !

Christian Grün

8:02 a.m.

Dear Philippe,

...

Perhaps it would be better if an option would let the user decide which behaviour the BaseX XML parser should apply.

To go further, an adaptative behaviour would be usefull for widely-used XML languages, such as XHTML or Docbook:

, <div>, and block-level elements : tokenize with that boundaries , , and other inline level elements: tokenize without that boundaries

...a good hint. Similar ideas were discussed, regarding a complete or partial indexing of the database, depending on certain elements. The main reason why have objected these ideas in the past was that we wanted to keep our user and programmer interface as simple as possible. However, this is all subject to change.

Regards, Christian

...

Cordialement,

/// (. .) --------ooO--(_)--Ooo-------- | Philippe Poulard | ----------------------------- http://reflex.gforge.inria.fr/ Have the RefleX !

5630

Age (days ago)

5636

Last active (days ago)

basex-talk@mailman.uni-konstanz.de

9 comments

3 participants

tags (0)

participants (3)

Christian Grün
Philippe Poulard
Thomas Goossens