Hi,
I have three questions concerning working with the full-text index:
The first question is about "distance" information. Given this query:
(1) 'contains text "Kopf Sand Stecken" all words using stemming using language "de"'
There is no difference to:
(2) 'contains text ("Nase" ftand "Sand" ftand "stecken") using stemming using language "de"'
both queries deliver 4 nodes.
If I would like to find the query terms within a certain distance, adding
'distance at most 10 words'
for (1) I get 2 nodes (a subset of the 4 from the first run), but for (2) I still get all 4 nodes. The information concerning distance doesn't seem to be considered. For my application this is no problem, since I have to go for the "ftand"-variant to get proper marking, but in general this looks strange.
The second question is about "ftand" and "ftor". If I try these queries:
(3) 'contains text ("Nase" ftand "Sand" ftand "stecken") using stemming using language "de" distance at most 10 words' (4) 'contains text ("Kopf" ftand "Sand" ftand "stecken") using stemming using language "de" distance at most 10 words
I get 2 hits for (3) and 11 for (4). So I assumed I would get 13 hits (the ones from (3) and the ones from (4) when changing the query to:
(5) 'contains text (("Nase" ftor "Kopf") ftand "Sand" ftand "stecken") using stemming using language "de" distance at most 10 words'
However, I get 6 hits -- none of them containing "Nase" (there is no difference, if the query starts with '"Nase" ftor "Kopf"' or with '"Kopf" ftor "Nase"').
Did I mess something up?
The third question is about the full-text index itself. When applying fuzzy search or using wildcards, the full-text index is not applied -- resulting in a time out on my website, I need 341859.09 ms in the GUI for applying
'ft:mark (//*[text() contains text ("Korb" ftand "geben") using fuzzy][self::*:p or self::*:l])'
to my 3 GB collection. The information at the "Full-Text" tab says:
- Structure: Trie - Stemming: ON - Case Sensitivity: ON - Diacritics: ON - Language: German - Size: 1 GB - Entries: 1743744
I created the full-text index with the option "Support Wildcards", too, but this information is not shown in the Database properties. When creating the index, "SET WILDCARDS true" is shown. I used stemming, casesensitivity, diacritics, and wildcards -- is this an unrecommended combination?
Thank you very much in advance
Cerstin
Cerstin, sorry for delaying the answer; be sure I'll give you detailed feedback as soon as I've resolved some other open issues. ___________________________
On Tue, Jan 17, 2012 at 3:10 PM, Cerstin Mahlow cerstin.mahlow@unibas.ch wrote:
Hi,
I have three questions concerning working with the full-text index:
The first question is about "distance" information. Given this query:
(1) 'contains text "Kopf Sand Stecken" all words using stemming using language "de"'
There is no difference to:
(2) 'contains text ("Nase" ftand "Sand" ftand "stecken") using stemming using language "de"'
both queries deliver 4 nodes.
If I would like to find the query terms within a certain distance, adding
'distance at most 10 words'
for (1) I get 2 nodes (a subset of the 4 from the first run), but for (2) I still get all 4 nodes. The information concerning distance doesn't seem to be considered. For my application this is no problem, since I have to go for the "ftand"-variant to get proper marking, but in general this looks strange.
The second question is about "ftand" and "ftor". If I try these queries:
(3) 'contains text ("Nase" ftand "Sand" ftand "stecken") using stemming using language "de" distance at most 10 words' (4) 'contains text ("Kopf" ftand "Sand" ftand "stecken") using stemming using language "de" distance at most 10 words
I get 2 hits for (3) and 11 for (4). So I assumed I would get 13 hits (the ones from (3) and the ones from (4) when changing the query to:
(5) 'contains text (("Nase" ftor "Kopf") ftand "Sand" ftand "stecken") using stemming using language "de" distance at most 10 words'
However, I get 6 hits -- none of them containing "Nase" (there is no difference, if the query starts with '"Nase" ftor "Kopf"' or with '"Kopf" ftor "Nase"').
Did I mess something up?
The third question is about the full-text index itself. When applying fuzzy search or using wildcards, the full-text index is not applied -- resulting in a time out on my website, I need 341859.09 ms in the GUI for applying
'ft:mark (//*[text() contains text ("Korb" ftand "geben") using fuzzy][self::*:p or self::*:l])'
to my 3 GB collection. The information at the "Full-Text" tab says:
- Structure: Trie
- Stemming: ON
- Case Sensitivity: ON
- Diacritics: ON
- Language: German
- Size: 1 GB
- Entries: 1743744
I created the full-text index with the option "Support Wildcards", too, but this information is not shown in the Database properties. When creating the index, "SET WILDCARDS true" is shown. I used stemming, casesensitivity, diacritics, and wildcards -- is this an unrecommended combination?
Thank you very much in advance
Cerstin
-- Dr. phil. Cerstin Mahlow
Universität Basel Deutsches Seminar Nadelberg 4 4051 Basel Schweiz
Tel: +41 61 267 07 65 Fax: +41 61 267 34 40 Mail: cerstin.mahlow@unibas.ch Web: http://www.oldphras.net
This message was sent using IMP, the Internet Messaging Program.
BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk
(1) 'contains text "Kopf Sand Stecken" all words using stemming using language "de"'
There is no difference to:
(2) 'contains text ("Nase" ftand "Sand" ftand "stecken") using stemming using language "de"'
both queries deliver 4 nodes.
If I would like to find the query terms within a certain distance, adding
'distance at most 10 words'
for (1) I get 2 nodes (a subset of the 4 from the first run), but for (2) I still get all 4 nodes. The information concerning distance doesn't seem to be considered. For my application this is no problem, since I have to go for the "ftand"-variant to get proper marking, but in general this looks strange.
This may well be correct, as (unfortunately) the internal data model of XQuery Full Text is pretty complex and leads to frequent misunderstandings. To give more information, I'll have to look at the actual data; do you think you can provide me with a little document that exemplifies your observation?
The second question is about "ftand" and "ftor". If I try these queries: ...
Once more, it might be helpful to have the actual data at hand..
The third question is about the full-text index itself. When applying fuzzy search or using wildcards, the full-text index is not applied -- resulting in a time out on my website, I need 341859.09 ms in the GUI for applying
Currently, the choice has to be made between efficient fuzzy or wildcard matching (the latter being based on a Trie index structure). Some information on that can be found in our Wiki [1] (btw, feel free to edit the Wiki if you feel it's incomplete!). We are working on a new index structure that will unify both index structures, improve performance, and support incremental updates. We may even eliminate the explicit choice of some other full-text options, such that those options can be dynamically chosen without the need to reindex the database.
More feature requests regarding the full-text index are welcome. Christian
Hi Christian,
I come back to some previously discussed questions:
Zitat von Christian Grün christian.gruen@gmail.com:
[...]
To give more information, I'll have to look at the actual data; do you think you can provide me with a little document that exemplifies your observation?
As I am not sure, if the behavior has something to do with my actual data, I didn't create an example, but put a sample of my collection consisting of 4 smaller documents online: http://oldphras.unibas.ch/test.tgz
//*[text() contains text ('Kopf' ftand 'Sand' ftand 'stecken') using stemming using language "de"][self::*:p or self::*:l]
gives 3 hits (in Wille, Suttner, and Cervantes)
//*[text() contains text ('Kopf' ftand 'Sand' ftand 'stecken') using stemming using language "de" distance at most 10 words][self::*:p or self::*:l]
gives 2 hits (in Wille and Suttner)
//*[text() contains text "Kopf Sand stecken" all words using stemming using language "de" distance at most 10 words][self::*:p or self::*:l]
gives 3 hits (in Wille, Suttner, and Cervantes), the "distance" option seems to be ignored.
The second question is about "ftand" and "ftor".
//*[text() contains text ('Kopf' ftand 'Sand' ftand 'stecken') using stemming using language "de" distance at most 10 words][self::*:p or self::*:l]
gives 2 hits (in Wille and Suttner)
//*[text() contains text ('Nase' ftand 'Sand' ftand 'stecken') using stemming using language "de" distance at most 10 words][self::*:p or self::*:l]
gives 1 hit (in Müllenhoff)
Therefore, for
//*[text() contains text ( ('Nase' ftor 'Kopf') ftand 'Sand' ftand 'stecken') using stemming using language "de" distance at most 10 words][self::*:p or self::*:l]
I would expect to get all 3 hits, but actually get only 1 (the one in Wille). It makes no difference, if I put ('Nase' ftor 'Kopf') or ('Kopf' ftor 'Nase'). Additionally, the highlighting is strange.
In the end, I would like to search for something like this to speed up annotating the data:
( Nase | Kopf | Hals ) & ( Sand | Schlinge ) & ( ziehen | stecken )
The third question is about the full-text index itself. When applying fuzzy search or using wildcards, the full-text index is not applied -- resulting in a time out on my website, I need 341859.09 ms in the GUI for applying
Currently, the choice has to be made between efficient fuzzy or wildcard matching (the latter being based on a Trie index structure).
So I can have fuzzy OR stemming and wildcard. For searching it's OK, I copied the collection and created the other index for the copy, but as I wan't to update the collection after searching, I would have to update both collections and re-index them after updating one. Is this correct?
Best regards
Cerstin
Dear Cerstin,
some quick feedback on your last e-mail, to get sure it doesn't get lost..
//*[text() contains text "Kopf Sand stecken" all words using stemming using language "de" distance at most 10 words][self::*:p or self::*:l]
gives 3 hits (in Wille, Suttner, and Cervantes), the "distance" option seems to be ignored.
True, this is a bug, which I've documented in a new GitHub issue [1]. Currently, this query returns different results if the full-text is activated or not, while it should always return the same results. Thanks for the analysis.
So I can have fuzzy OR stemming and wildcard. For searching it's OK, I copied the collection and created the other index for the copy, but as I wan't to update the collection after searching, I would have to update both collections and re-index them after updating one. Is this correct?
Yep, that's correct. Another tracker entry [2] addresses this: in future, we will only provide one single full-text index that will support both fuzzy and wildcard searches. Ideally, the new index will already support incremental update, such that there won't be any need to re-index your data.
Christian
[1] https://github.com/BaseXdb/basex/issues/359 [2] https://github.com/BaseXdb/basex/issues/346
Hi,
I just downloaded the 7.1 stable version.
Am 30.01.2012 um 01:47 schrieb Christian Grün:
//*[text() contains text "Kopf Sand stecken" all words using stemming using language "de" distance at most 10 words][self::*:p or self::*:l]
gives 3 hits (in Wille, Suttner, and Cervantes), the "distance" option seems to be ignored.
True, this is a bug, which I've documented in a new GitHub issue [1].
It seems that the "ordered" option is ignored as well.
Best regards
Cerstin
Hi Cerstin,
thanks; this issue seems in fact to be related to the distance issue [1], which I've just extended with a small comment.
Keeping you updated, Christian
[1] https://github.com/BaseXdb/basex/issues/359
On Thu, Feb 9, 2012 at 4:04 PM, Cerstin Mahlow cerstin.mahlow@unibas.ch wrote:
Hi,
I just downloaded the 7.1 stable version.
Am 30.01.2012 um 01:47 schrieb Christian Grün:
//*[text() contains text "Kopf Sand stecken" all words using stemming using language "de" distance at most 10 words][self::*:p or self::*:l]
gives 3 hits (in Wille, Suttner, and Cervantes), the "distance" option seems to be ignored.
True, this is a bug, which I've documented in a new GitHub issue [1].
It seems that the "ordered" option is ignored as well.
Best regards
Cerstin
-- Dr. phil. Cerstin Mahlow
Universität Basel Deutsches Seminar Nadelberg 4 4051 Basel Schweiz
Tel: +41 61 267 07 65 Fax: +41 61 267 34 40 Mail: cerstin.mahlow@unibas.ch Web: http://www.oldphras.net
BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk
basex-talk@mailman.uni-konstanz.de