index features - BaseX-Talk - mailman.uni-konstanz.de

17 Jan 2012


      Hi,
I have three questions concerning working with the full-text index:
The first question is about "distance" information. Given this query:
(1) 'contains text "Kopf Sand Stecken" all words using stemming using  
language "de"'
There is no difference to:
(2) 'contains text ("Nase" ftand "Sand" ftand "stecken") using  
stemming using language "de"'
both queries deliver 4 nodes.
If I would like to find the query terms within a certain distance, adding
'distance at most 10 words'
for (1) I get 2 nodes (a subset of the 4 from the first run), but for  
(2) I still get all 4 nodes. The information concerning distance  
doesn't seem to be considered. For my application this is no problem,  
since I have to go for the "ftand"-variant to get proper marking, but  
in general this looks strange.
The second question is about "ftand" and "ftor". If I try these queries:
(3) 'contains text ("Nase" ftand "Sand" ftand "stecken") using  
stemming using language "de" distance at most 10 words'
(4) 'contains text ("Kopf" ftand "Sand" ftand "stecken") using  
stemming using language "de" distance at most 10 words
I get 2 hits for (3) and 11 for (4). So I assumed I would get 13 hits  
(the ones from (3) and the ones from (4) when changing the query to:
(5) 'contains text (("Nase" ftor "Kopf") ftand "Sand" ftand "stecken")  
using stemming using language "de" distance at most 10 words'
However, I get 6 hits -- none of them containing "Nase" (there is no  
difference, if the query starts with '"Nase" ftor "Kopf"' or with  
'"Kopf" ftor "Nase"').
Did I mess something up?
The third question is about the full-text index itself. When applying  
fuzzy search or using wildcards, the full-text index is not applied --  
resulting in a time out on my website, I need 341859.09 ms in the GUI  
for applying
'ft:mark (//*[text() contains text ("Korb" ftand "geben") using  
fuzzy][self::*:p or self::*:l])'
to my 3 GB collection. The information at the "Full-Text" tab says:
- Structure: Trie
- Stemming: ON
- Case Sensitivity: ON
- Diacritics: ON
- Language: German
- Size: 1 GB
- Entries: 1743744
I created the full-text index with the option "Support Wildcards",  
too, but this information is not shown in the Database properties.  
When creating the index, "SET WILDCARDS true" is shown. I used  
stemming, casesensitivity, diacritics, and wildcards -- is this an  
unrecommended combination?
Thank you very much in advance
Cerstin
-- 
Dr. phil. Cerstin Mahlow

Universität Basel
Deutsches Seminar
Nadelberg 4
4051 Basel
Schweiz

Tel:  +41 61 267 07 65
Fax: +41 61 267 34 40
Mail: cerstin.mahlow@unibas.ch
Web: http://www.oldphras.net

----------------------------------------------------------------
This message was sent using IMP, the Internet Messaging Program.