Christian, I will second your description of this logic as “nonintuitive”. It seems to be driven more by efficiency concerns than usability (on the part of the W3C). Would it be possible to create a custom index structure in BaseX that would get around this limitation? If yes, as you seem to suggest below, can this be done dynamically? I had difficulty following the example in [2].

Thanks,
Ron

On February 2, 2016 at 2:34:35 PM, Christian Grün (christian.gruen@gmail.com) wrote:

> Any idea why?

Yes – See one of my previous replies ;) In a nutshell: In the first
query, stopwords will be dropped. In the second one, they will only be
ignored (“Tokens matched by stop words retain their position numbers
[…]” [1]):

"A B C" contains text "A C" using stop words ("B")
→ false
"A B C" contains text "A B C" using stop words ("B")
→ true

It may not be the most intuitive decision that has been taken back
then by the designers of the spec, but… Les jeux sont faits.

In some projects, we’ve decided to work with custom index structures
[2]. It’s some more work, but it will give you complete freedom on
what tokens you want to store.

Hope this helps,
Christian

[1] https://www.w3.org/TR/xpath-full-text-10/#ftstopwordoption
[2] http://docs.basex.org/wiki/Indexes#Custom_Index_Structures


On Tue, Feb 2, 2016 at 6:56 PM, Ron Katriel <rkatriel@mdsol.com> wrote:
> Thanks, Christian. You are right about the tokenization of ampersands.
> However, I still see unexpected behavior with the built-in stop words.
>
> 1. This works (using your clever stop word workaround, slightly modified
> with string-join):
>
> let $sw := map:merge(
> for $sw in file:read-text-lines('stopwords.txt')
> return map { $sw : true() }
> )
>
> let $t1 := 'Frontier Science &amp; Technology Research Foundation, Inc.'
> let $t2 := 'Frontier Science and Technology Research Foundation, Inc.'
> let $q1 := string-join(ft:tokenize($t1)[not($sw(.))], ' ')
> let $q2 := string-join(ft:tokenize($t2)[not($sw(.))], ' ')
> where $q1 contains text { $q2 }
> return <r> { <q1> { $q1 } </q1>, <q2> { $q2 } </q2> } </r>
>
> 2. This fails:
>
> let $t1 := 'Frontier Science &amp; Technology Research Foundation, Inc.'
> let $t2 := 'Frontier Science and Technology Research Foundation, Inc.'
> where $t1 contains text { $t2 } using stop words at 'stopwords.txt' or
> $t2 contains text { $t1 } using stop words at 'stopwords.txt'
> return <r> { <q1> { $t1 } </q1>, <q2> { $t2 } </q2> } </r>
>
> Any idea why?
>
> Thanks,
> Ron
>
> On February 2, 2016 at 12:13:14 PM, Christian Grün
> (christian.gruen@gmail.com) wrote:
>
> Hi Ron,
>
> I’m pretty sure that the default tokenizer discards the ampersand and
> doesn’t pass it on as token at all.
>
> Hope this helps (…at least for understanding the query result),
> Christian
>
>
>
> On Tue, Feb 2, 2016 at 6:10 PM, Ron Katriel <rkatriel@mdsol.com> wrote:
>> Hi,
>>
>> Given this thesaurus entry
>>
>> <thesaurus xmlns="http://www.w3.org/2007/xqftts/thesaurus">
>> <entry>
>> <term>&amp;</term>
>> <synonym>
>> <term>and</term>
>> <relationship>USE</relationship>
>> </synonym>
>> </entry>
>> </thesaurus>
>>
>> I was expecting the following query to return true (file path omitted for
>> clarify)
>>
>> 'Frontier Science and Technology Research Foundation, Inc.' contains text
>> 'Frontier Science &amp; Technology Research Foundation, Inc.' using
>> thesaurus at "thesaurus.xml”
>>
>> but it returns false. Switching the order of the term and synonym makes no
>> difference.
>>
>> I tried getting around this using a stop word file (which includes ‘and’,
>> ‘&’, and '&amp;’, just in case) but it does not work either.
>>
>> Am I missing something?
>>
>> Thanks,
>> Ron
>>