Hi,
in my project, users look for text nodes containing certain words. However, some of these queries result in a very large number of hits, with a lot of irrelevant ones. Is it possible to search with excluding xqueries?
For example in this document
<p>A B C D E F</p> <p>A B D E F C</p> <p>A B D E F</p> <p>A B D F</p>
when looking for
[text() contains contains text "A" ftand "B" ftand "E" ordered distance at most 3 words]
I would like to make explicit, that "C" should not occur at all, to ensure to find only the third node, not the first nor second one -- i.e., "C" should not occur outside the sequence, nor inside the sequence.
Is this possible at all? Can I also specify a query where "C" would be allowed to occur outside the sequence, like in the second node, but not inside like in the first node? Or the other way around?
Best regards
Cerstin
Hi Cerstin,
yes, that’s indeed possible, and even supported by the index, as long as the resulting expressions don’t get too nested. The following query should do what you are looking for:
//*[text() contains text "A" ftand ftnot 'C']
Is this possible at all? Can I also specify a query where "C" would be allowed to occur outside the sequence, like in the second node, but not inside like in the first node? Or the other way around?
This sounds tricky; while I wouldn’t say that it’s impossible, there is no XQuery Full Text syntax for such a particular case.
Christian
Hi Christian,
Zitat von Christian Grün christian.gruen@gmail.com:
//*[text() contains text "A" ftand ftnot 'C']
Thanks, this seems to work. However, I encountered strange behavior, which is probably related to mixed content.
Given this document:
<doc> <p>1 Ich fresse Dich mit Haut und Haar <pb/> und allem drum und dran.</p> <p>2 Ich fresse Dich mit Haut und <pb/> Haar und allem drum und dran.</p> <p>3 Ich fresse Dich mit Haut und Fell und allem drum und dran.</p> <p>4 Ich fresse Dich mit Haut und Pelz und allem drum und dran.</p> <p>5 Ich werde Dich mit Haut und Haar <pb/> und allem drum und dran fressen.</p> <p>6 Du kannst mich mit Haut und Haar und allem drum und dran fressen.</p> </doc>
from which I created a collection with whitespacechopping OFF, stemming for German ON. And then I run these queries:
(1) //*[text() contains text ("Haut" ftand "fressen") using stemming using language "de"] (2) //*[text() contains text ("Haut" ftand "fressen" ftand ftnot "Haar") using stemming using language "de"]
(1) should return all <p>-nodes, but does not return 5 (2) should return 1, 3, and 4, but does return 2, 3, and 4.
Is it correct, that when looking into a node, only text _before_ any other node will be handled, i.e. fore the first <p> node, only until "Haar", for the second one only until "und" and for the fifth one only until "Haar".
So everything after another node included in a particular node will be ignored? As there are a lot of nodes like page-breakes or line-breakes (not including relevant text, but only rendering information) in TEI-documents, this is rather irritating. There is no way to search the whole text of a paragraph or line node.
Best regards
Cerstin
<p>5 Ich werde Dich mit Haut und Haar <pb/> und allem drum und dran fressen.</p>
(1) //*[text() contains text ("Haut" ftand "fressen") using stemming using language "de"] (2) //*[text() contains text ("Haut" ftand "fressen" ftand ftnot "Haar") using stemming using language "de"]
(1) should return all <p>-nodes, but does not return 5
This is actually correct, because the <p/> element has two text nodes, and the full-text expression is evaluated against them separately.
You'll get the expected results by replacing the text() step with a dot:
//*[ . contains text ("Haut" ftand "fressen") using stemming using language "de"]
It's important to add, however, that this query cannot be evaluated by the full-text index.
Christian
Zitat von Christian Grün christian.gruen@gmail.com:
<p>5 Ich werde Dich mit Haut und Haar <pb/> und allem drum und dran fressen.</p>
(1) //*[text() contains text ("Haut" ftand "fressen") using stemming using language "de"] (2) //*[text() contains text ("Haut" ftand "fressen" ftand ftnot "Haar") using stemming using language "de"]
(1) should return all <p>-nodes, but does not return 5
This is actually correct, because the <p/> element has two text nodes, and the full-text expression is evaluated against them separately.
You'll get the expected results by replacing the text() step with a dot:
//*[ . contains text ("Haut" ftand "fressen") using stemming using language "de"]
It's important to add, however, that this query cannot be evaluated by the full-text index.
Since I have to use the full-text index to avoid time-outs in the web-interface, this means searching text in paragraphs yields in fact incomplete results? I better don't tell my users ...
Oh, and the GUI for 7.3.1 crashes constantly trying to execute queries that take a little bit longer than a minute, very annoying!
Best regards
Cerstin
basex-talk@mailman.uni-konstanz.de