Dear List,
I have a query (included below) which uses full text searching and it is fairly slow using BaseX 7.9 (5-8 seconds). I was wondering if there was a query cache implemented in BaseX? I ask because the GUI seems to keep a cache around because the same query goes from 5000ms to 204ms if I run it twice in the GUI.
The query looks like this:
declare variable $term as xs:string external; declare variable $col as xs:string external; <results>{subsequence(ft:mark(for $x in collection($col)//entry where $x//text()[. contains text {$term} using wildcards] order by fn:lower-case(fn:replace(($x//orth[1]/text())[1], '\p{P}|\d+','')) collation "?lang=ga" return $x), 1, 5000)}</results>
Thanks for any help you can provide.
Chris Yocum
Hi Chris,
there are various caches involved when evaluating queries, but I can't see for the given query where a cache may be utilized. However, your query may be evaluated faster if you simplify the nested where clause:
<results>{ subsequence( ft:mark( for $x in collection($col)//entry where $x//text() contains text { $term } using wildcards order by fn:lower-case( fn:replace(($x//orth[1]/text())[1], '\p{P}|\d+','') ) collation "?lang=ga" return $x ), 1, 5000 ) }</results>
You could as well use a predicate with position(), it may be evaluated faster than subsequence (I'm not sure, though, because most time will probably be spent for ordering all results):
<results>{ ft:mark( for $x in collection($col)//entry where $x//text() contains text { $term } using wildcards order by fn:lower-case( fn:replace(($x//orth[1]/text())[1], '\p{P}|\d+','') ) collation "?lang=ga" return $x )[position() = 1 to 5000] }</results>
Could you please open the InfoView in the GUI, execute the query again and check if the full-text index is applied?
Christian
On Wed, Aug 13, 2014 at 12:02 PM, Christopher Yocum cyocum@gmail.com wrote:
declare variable $term as xs:string external; declare variable $col as xs:string external; <results>{subsequence(ft:mark(for $x in collection($col)//entry where $x//text()[. contains text {$term} using wildcards] order by fn:lower-case(fn:replace(($x//orth[1]/text())[1], '\p{P}|\d+','')) collation "?lang=ga" return $x), 1, 5000)}</results>
Dear Christian,
Thank you for the info. I will make the changes you suggest and I will let you know the information from the InfoView when I get the chance.
All the best, Chris
On Wed, Aug 13, 2014 at 12:18 PM, Christian Grün christian.gruen@gmail.com wrote:
Hi Chris,
there are various caches involved when evaluating queries, but I can't see for the given query where a cache may be utilized. However, your query may be evaluated faster if you simplify the nested where clause:
<results>{ subsequence( ft:mark( for $x in collection($col)//entry where $x//text() contains text { $term } using wildcards order by fn:lower-case( fn:replace(($x//orth[1]/text())[1], '\p{P}|\d+','') ) collation "?lang=ga" return $x ), 1, 5000 ) }</results>
You could as well use a predicate with position(), it may be evaluated faster than subsequence (I'm not sure, though, because most time will probably be spent for ordering all results):
<results>{ ft:mark( for $x in collection($col)//entry where $x//text() contains text { $term } using wildcards order by fn:lower-case( fn:replace(($x//orth[1]/text())[1], '\p{P}|\d+','') ) collation "?lang=ga" return $x )[position() = 1 to 5000] }</results>
Could you please open the InfoView in the GUI, execute the query again and check if the full-text index is applied?
Christian
On Wed, Aug 13, 2014 at 12:02 PM, Christopher Yocum cyocum@gmail.com wrote:
declare variable $term as xs:string external; declare variable $col as xs:string external; <results>{subsequence(ft:mark(for $x in collection($col)//entry where $x//text()[. contains text {$term} using wildcards] order by fn:lower-case(fn:replace(($x//orth[1]/text())[1], '\p{P}|\d+','')) collation "?lang=ga" return $x), 1, 5000)}</results>
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA256
On Wed, Aug 13, 2014 at 01:18:26PM +0200, Christian Grün wrote:
Hi Chris,
there are various caches involved when evaluating queries, but I can't see for the given query where a cache may be utilized. However, your query may be evaluated faster if you simplify the nested where clause:
<results>{ subsequence( ft:mark( for $x in collection($col)//entry where $x//text() contains text { $term } using wildcards order by fn:lower-case( fn:replace(($x//orth[1]/text())[1], '\p{P}|\d+','') ) collation "?lang=ga" return $x ), 1, 5000 ) }</results>
You could as well use a predicate with position(), it may be evaluated faster than subsequence (I'm not sure, though, because most time will probably be spent for ordering all results):
<results>{ ft:mark( for $x in collection($col)//entry where $x//text() contains text { $term } using wildcards order by fn:lower-case( fn:replace(($x//orth[1]/text())[1], '\p{P}|\d+','') ) collation "?lang=ga" return $x )[position() = 1 to 5000] }</results>
Could you please open the InfoView in the GUI, execute the query again and check if the full-text index is applied?
Christian
Dear Christian,
I have run the query on the server and I obtained this query plan:
Query plan: <QueryPlan> <CElem> <QNm value="results" type="xs:QName"/> <FNSeq name="subsequence(items,start[,len])"> <FNFt name="mark(nodes[,tag])"> <GFLWOR> <For> <Var name="$x" id="0"/> <CachedPath> <FTIndexAccess data="edil"> <FTWords> <Str value="athgab.*" type="xs:string"/> </FTWords> </FTIndexAccess> <IterStep axis="ancestor" test="*:entry"/> </CachedPath> </For> <OrderBy> <Key dir="ascending" empty="least"> <FNStr name="lower-case(string)"> <FNPat name="replace(string,pattern,replace[,mod])"> <IterPosFilter> <CachedPath> <VarRef> <Var name="$x" id="0"/> </VarRef> <IterStep axis="descendant-or-self" test="node()"/> <IterPosStep axis="child" test="orth"> <Pos min="1" max="1"/> </IterPosStep> <IterStep axis="child" test="text()"/> </CachedPath> <Pos min="1" max="1"/> </IterPosFilter> <Str value="\p{P}|\d+" type="xs:string"/> <Str value="" type="xs:string"/> </FNPat> </FNStr> </Key> </OrderBy> <VarRef> <Var name="$x" id="0"/> </VarRef> </GFLWOR> </FNFt> <Int value="1" type="xs:integer"/> <Int value="5000" type="xs:integer"/> </FNSeq> </CElem> </QueryPlan>
I ran the same query on my laptop in the GUI and got this in the Query Info. I couldn't find the "InfoView" so I hope this helps.
Compiling: - - pre-evaluating fn:collection("edil") - - simplifying descendant-or-self step(s) - - converting descendant::*:entry to child steps - - simplifying descendant-or-self step(s) - - removing context expression (.) - - rewriting where clause(s) Query: declare variable $term as xs:string external := 'athgab.*'; declare variable $col as xs:string external := 'edil'; <results>{subsequence(ft:mark(for $x in collection($col)//entry where $x//text()[. contains text {$term} using wildcards] order by fn:lower-case(fn:replace(($x//orth[1]/text())[1], '\p{P}|\d+','')) collation "?lang=ga" return $x), 1, 5000)}</results> Optimized Query: element results { (fn:subsequence(ft:mark(for $x_0 in (db:open-pre("edil",0), db:open-pre("edil",395952), ...)/*:sample/*:entry[descendant::text()[. contains text "athgab.*" using wildcards using language 'English']] order by fn:lower-case(fn:replace($x_0/descendant-or-self::node()/orth[1]/text()[1], "\p{P}|\d+", "")) empty least collation "http://basex.org/collation?lang=ga" return $x_0), 1, 5000)) } Result: - - Hit(s): 1 Item - - Updated: 0 Items - - Printed: 2048 KB - - Read Locking: global - - Write Locking: none Timing: - - Parsing: 1.95 ms - - Compiling: 21.41 ms - - Evaluating: 4637.3 ms - - Printing: 76.31 ms - - Total Time: 4736.97 ms Query plan: <QueryPlan> <CElem> <QNm value="results" type="xs:QName"/> <FNSeq name="subsequence(items,start[,len])"> <FNFt name="mark(nodes[,tag])"> <GFLWOR> <For> <Var name="$x" id="0"/> <IterPath> <DBNodeSeq size="19"> <DBNode name="edil" pre="0"/> <DBNode name="edil" pre="395952"/> <DBNode name="edil" pre="690511"/> <DBNode name="edil" pre="898347"/> <DBNode name="edil" pre="1054095"/> </DBNodeSeq> <IterStep axis="child" test="*:sample"/> <IterStep axis="child" test="*:entry"> <IterPath> <IterStep axis="descendant" test="text()"> <FTContainsExpr> <Context/> <FTWords> <Str value="athgab.*" type="xs:string"/> </FTWords> </FTContainsExpr> </IterStep> </IterPath> </IterStep> </IterPath> </For> <OrderBy> <Key dir="ascending" empty="least"> <FNStr name="lower-case(string)"> <FNPat name="replace(string,pattern,replace[,mod])"> <IterPosFilter> <CachedPath> <VarRef> <Var name="$x" id="0"/> </VarRef> <IterStep axis="descendant-or-self" test="node()"/> <IterPosStep axis="child" test="orth"> <Pos min="1" max="1"/> </IterPosStep> <IterStep axis="child" test="text()"/> </CachedPath> <Pos min="1" max="1"/> </IterPosFilter> <Str value="\p{P}|\d+" type="xs:string"/> <Str value="" type="xs:string"/> </FNPat> </FNStr> </Key> </OrderBy> <VarRef> <Var name="$x" id="0"/> </VarRef> </GFLWOR> </FNFt> <Int value="1" type="xs:integer"/> <Int value="5000" type="xs:integer"/> </FNSeq> </CElem> </QueryPlan>
I hope this is enough information for you to help me. If I run the query twice in the GUI, the execution time usually halves.
On Wed, Aug 13, 2014 at 12:02 PM, Christopher Yocum cyocum@gmail.com wrote:
declare variable $term as xs:string external; declare variable $col as xs:string external; <results>{subsequence(ft:mark(for $x in collection($col)//entry where $x//text()[. contains text {$term} using wildcards] order by fn:lower-case(fn:replace(($x//orth[1]/text())[1], '\p{P}|\d+','')) collation "?lang=ga" return $x), 1, 5000)}</results>
Hi Chris,
<FTIndexAccess data="edil">
This indicates that the full-text index is indeed used here.
I ran the same query on my laptop in the GUI and got this in the Query Info. I couldn't find the "InfoView" so I hope this helps.
Sorry, I was referring to the query info, which you found anyway..
Compiling:
- pre-evaluating fn:collection("edil")
...
I'm missing the info "applying full-text index..." here, probably because you need to rewrite your code from...
collection($col)//entry where $x//text()[. contains text {$term} using wildcards]
...to...
collection($col)//entry where $x//text() contains text {$term} using wildcards
Could you try this and report about the performance? Christian
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA256
On Thu, Aug 14, 2014 at 08:03:39PM +0200, Christian Grün wrote:
Hi Chris,
<FTIndexAccess data="edil">
This indicates that the full-text index is indeed used here.
I ran the same query on my laptop in the GUI and got this in the Query Info. I couldn't find the "InfoView" so I hope this helps.
Sorry, I was referring to the query info, which you found anyway..
Compiling:
- pre-evaluating fn:collection("edil")
...
I'm missing the info "applying full-text index..." here, probably because you need to rewrite your code from...
collection($col)//entry where $x//text()[. contains text {$term} using wildcards]
...to...
collection($col)//entry where $x//text() contains text {$term} using wildcards
Could you try this and report about the performance? Christian
Hi Christian,
Thank you very much for your quick reply. Yes, this seems to be much faster and I see "applying full text index".
Compiling: - - pre-evaluating fn:collection("edil") - - simplifying descendant-or-self step(s) - - converting descendant::*:entry to child steps - - simplifying descendant-or-self step(s) - - removing context expression (.) - - applying full-text index - - rewriting where clause(s) Query: declare variable $term as xs:string external := 'athgab.*'; declare variable $col as xs:string external := 'edil'; <results>{subsequence(ft:mark(for $x in collection($col)//entry where $x//text() contains text {$term} using wildcards order by fn:lower-case(fn:replace(($x//orth[1]/text())[1], '\p{P}|\d+','')) collation "?lang=ga" return $x), 1, 5000)}</results> Optimized Query: element results { (fn:subsequence(ft:mark(for $x_0 in ft:search("edil", "athgab.*")/ancestor::*:entry order by fn:lower-case(fn:replace($x_0/descendant-or-self::node()/orth[1]/text()[1], "\p{P}|\d+", "")) empty least collation "http://basex.org/collation?lang=ga" return $x_0), 1, 5000)) } Result: - - Hit(s): 1 Item - - Updated: 0 Items - - Printed: 2048 KB - - Read Locking: global - - Write Locking: none Timing: - - Parsing: 3.32 ms - - Compiling: 32.89 ms - - Evaluating: 1365.49 ms - - Printing: 71.55 ms - - Total Time: 1473.25 ms Query plan: <QueryPlan> <CElem> <QNm value="results" type="xs:QName"/> <FNSeq name="subsequence(items,start[,len])"> <FNFt name="mark(nodes[,tag])"> <GFLWOR> <For> <Var name="$x" id="0"/> <CachedPath> <FTIndexAccess data="edil"> <FTWords> <Str value="athgab.*" type="xs:string"/> </FTWords> </FTIndexAccess> <IterStep axis="ancestor" test="*:entry"/> </CachedPath> </For> <OrderBy> <Key dir="ascending" empty="least"> <FNStr name="lower-case(string)"> <FNPat name="replace(string,pattern,replace[,mod])"> <IterPosFilter> <CachedPath> <VarRef> <Var name="$x" id="0"/> </VarRef> <IterStep axis="descendant-or-self" test="node()"/> <IterPosStep axis="child" test="orth"> <Pos min="1" max="1"/> </IterPosStep> <IterStep axis="child" test="text()"/> </CachedPath> <Pos min="1" max="1"/> </IterPosFilter> <Str value="\p{P}|\d+" type="xs:string"/> <Str value="" type="xs:string"/> </FNPat> </FNStr> </Key> </OrderBy> <VarRef> <Var name="$x" id="0"/> </VarRef> </GFLWOR> </FNFt> <Int value="1" type="xs:integer"/> <Int value="5000" type="xs:integer"/> </FNSeq> </CElem> </QueryPlan>
Hi Christian,
Apologies for bringing this back up but if I use "using diacritics insensitive" in the full text search, it seems to turn full text searching off. I have diacritics true on the database. I am just suprised to see diacritics causing the full text searching to be turned off.
All the best, Chris
On Wed, Aug 13, 2014 at 01:18:26PM +0200, Christian Grün wrote:
Hi Chris,
there are various caches involved when evaluating queries, but I can't see for the given query where a cache may be utilized. However, your query may be evaluated faster if you simplify the nested where clause:
<results>{ subsequence( ft:mark( for $x in collection($col)//entry where $x//text() contains text { $term } using wildcards order by fn:lower-case( fn:replace(($x//orth[1]/text())[1], '\p{P}|\d+','') ) collation "?lang=ga" return $x ), 1, 5000 ) }</results>
You could as well use a predicate with position(), it may be evaluated faster than subsequence (I'm not sure, though, because most time will probably be spent for ordering all results):
<results>{ ft:mark( for $x in collection($col)//entry where $x//text() contains text { $term } using wildcards order by fn:lower-case( fn:replace(($x//orth[1]/text())[1], '\p{P}|\d+','') ) collation "?lang=ga" return $x )[position() = 1 to 5000] }</results>
Could you please open the InfoView in the GUI, execute the query again and check if the full-text index is applied?
Christian
On Wed, Aug 13, 2014 at 12:02 PM, Christopher Yocum cyocum@gmail.com wrote:
declare variable $term as xs:string external; declare variable $col as xs:string external; <results>{subsequence(ft:mark(for $x in collection($col)//entry where $x//text()[. contains text {$term} using wildcards] order by fn:lower-case(fn:replace(($x//orth[1]/text())[1], '\p{P}|\d+','')) collation "?lang=ga" return $x), 1, 5000)}</results>
Hi Chris,
as you already noted, the full-text index will only be utilized with the options that you choose when creating an index. If you want to do more fine-grained searches, it's usually recommendable to choose the most general options for creating the index (case insensitive, diacritics insensitive, etc). and then refine the results in a second step. This can e.g. look as follows :
declare function local:search($db, $terms) { for $result in db:open($db)//*[text() contains text { $terms }] return $result[text() contains text { $terms } using case sensitive] }; local:search('factbook', ('German', 'English'))
Hope this helps, Christian
On Thu, Aug 14, 2014 at 10:54 PM, Chris Yocum cyocum@gmail.com wrote:
Hi Christian,
Apologies for bringing this back up but if I use "using diacritics insensitive" in the full text search, it seems to turn full text searching off. I have diacritics true on the database. I am just suprised to see diacritics causing the full text searching to be turned off.
All the best, Chris
On Wed, Aug 13, 2014 at 01:18:26PM +0200, Christian Grün wrote:
Hi Chris,
there are various caches involved when evaluating queries, but I can't see for the given query where a cache may be utilized. However, your query may be evaluated faster if you simplify the nested where clause:
<results>{ subsequence( ft:mark( for $x in collection($col)//entry where $x//text() contains text { $term } using wildcards order by fn:lower-case( fn:replace(($x//orth[1]/text())[1], '\p{P}|\d+','') ) collation "?lang=ga" return $x ), 1, 5000 ) }</results>
You could as well use a predicate with position(), it may be evaluated faster than subsequence (I'm not sure, though, because most time will probably be spent for ordering all results):
<results>{ ft:mark( for $x in collection($col)//entry where $x//text() contains text { $term } using wildcards order by fn:lower-case( fn:replace(($x//orth[1]/text())[1], '\p{P}|\d+','') ) collation "?lang=ga" return $x )[position() = 1 to 5000] }</results>
Could you please open the InfoView in the GUI, execute the query again and check if the full-text index is applied?
Christian
On Wed, Aug 13, 2014 at 12:02 PM, Christopher Yocum cyocum@gmail.com
wrote:
declare variable $term as xs:string external; declare variable $col as xs:string external; <results>{subsequence(ft:mark(for $x in collection($col)//entry where $x//text()[. contains text {$term} using wildcards] order by fn:lower-case(fn:replace(($x//orth[1]/text())[1], '\p{P}|\d+','')) collation "?lang=ga" return $x), 1,
5000)}</results>
Hi Christian,
Thank you again for all your help. Unfortunately, my documents are multi-language and multi-diacritics so my users expect it to match athgabáil, athgabail, and athgabāil as the same word. They also want wildcard searching to work in the same way. This seems to mean that basex' full text index would have to be added to or restructured in some way to make "using diacritic insensitive" with "using wildcard" at the same time. I cannot think at the moment how to break the two into separate search as you suggest. Maybe it will come to me later today.
At the moment the query looks like this and it does not use the full text index:
declare variable $term as xs:string external := 'athgab.*'; declare variable $col as xs:string external := 'edil'; <results>{subsequence(ft:mark(for $x in collection($col)//entry where $x//text() contains text {$term} using wildcards using diacritics insensitive order by fn:lower-case(fn:replace(($x//orth[1]/text())[1], '\p{P}|\d+','')) collation "?lang=ga" return $x), 1, 5000)}</results>
If anyone has any suggestions, I would be grateful.
All the best, Chris
On Thu, Aug 14, 2014 at 10:35 PM, Christian Grün christian.gruen@gmail.com wrote:
Hi Chris,
as you already noted, the full-text index will only be utilized with the options that you choose when creating an index. If you want to do more fine-grained searches, it’s usually recommendable to choose the most general options for creating the index (case insensitive, diacritics insensitive, etc). and then refine the results in a second step. This can e.g. look as follows :
declare function local:search($db, $terms) { for $result in db:open($db)//*[text() contains text { $terms }] return $result[text() contains text { $terms } using case sensitive] }; local:search('factbook', ('German', 'English'))
Hope this helps, Christian
On Thu, Aug 14, 2014 at 10:54 PM, Chris Yocum cyocum@gmail.com wrote:
Hi Christian,
Apologies for bringing this back up but if I use "using diacritics insensitive" in the full text search, it seems to turn full text searching off. I have diacritics true on the database. I am just suprised to see diacritics causing the full text searching to be turned off.
All the best, Chris
On Wed, Aug 13, 2014 at 01:18:26PM +0200, Christian Grün wrote:
Hi Chris,
there are various caches involved when evaluating queries, but I can't see for the given query where a cache may be utilized. However, your query may be evaluated faster if you simplify the nested where clause:
<results>{ subsequence( ft:mark( for $x in collection($col)//entry where $x//text() contains text { $term } using wildcards order by fn:lower-case( fn:replace(($x//orth[1]/text())[1], '\p{P}|\d+','') ) collation "?lang=ga" return $x ), 1, 5000 ) }</results>
You could as well use a predicate with position(), it may be evaluated faster than subsequence (I'm not sure, though, because most time will probably be spent for ordering all results):
<results>{ ft:mark( for $x in collection($col)//entry where $x//text() contains text { $term } using wildcards order by fn:lower-case( fn:replace(($x//orth[1]/text())[1], '\p{P}|\d+','') ) collation "?lang=ga" return $x )[position() = 1 to 5000] }</results>
Could you please open the InfoView in the GUI, execute the query again and check if the full-text index is applied?
Christian
On Wed, Aug 13, 2014 at 12:02 PM, Christopher Yocum cyocum@gmail.com
wrote:
declare variable $term as xs:string external; declare variable $col as xs:string external; <results>{subsequence(ft:mark(for $x in collection($col)//entry where $x//text()[. contains text {$term} using wildcards] order by fn:lower-case(fn:replace(($x//orth[1]/text())[1], '\p{P}|\d+','')) collation "?lang=ga" return $x), 1,
5000)}</results>
Hi Chris,
sorry for letting you wait, I’ve been offline over the weekend.
Thank you again for all your help. Unfortunately, my documents are multi-language and multi-diacritics so my users expect it to match athgabáil, athgabail, and athgabāil as the same word. They also want wildcard searching to work in the same way.
This should be no problem, even with the full-text default settings. An example: the following query...
/descendant::*[text() contains text 'athgabāi.*' using diacritics insensitive using wildcards]
...will give you three results for the following document...
<xml> <term>athgabáil</term> <term>athgabail</term> <term>athgabāil</term> </xml>
...and the results will be retrieved by the full-text index, using the default settings:
- applying full-text index for "athgabāi.*" using wildcards using language 'English'
The solution that I mentioned in my last mail is required if you want to do both diacritics sensitive and insensitive search.
Does this help? Christian
At the moment the query looks like this and it does not use the full text index:
declare variable $term as xs:string external := 'athgab.*'; declare variable $col as xs:string external := 'edil'; <results>{subsequence(ft:mark(for $x in collection($col)//entry where $x//text() contains text {$term} using wildcards using diacritics insensitive order by fn:lower-case(fn:replace(($x//orth[1]/text())[1], '\p{P}|\d+','')) collation "?lang=ga" return $x), 1, 5000)}</results>
If anyone has any suggestions, I would be grateful.
All the best, Chris
On Thu, Aug 14, 2014 at 10:35 PM, Christian Grün christian.gruen@gmail.com wrote:
Hi Chris,
as you already noted, the full-text index will only be utilized with the options that you choose when creating an index. If you want to do more fine-grained searches, it’s usually recommendable to choose the most general options for creating the index (case insensitive, diacritics insensitive, etc). and then refine the results in a second step. This can e.g. look as follows :
declare function local:search($db, $terms) { for $result in db:open($db)//*[text() contains text { $terms }] return $result[text() contains text { $terms } using case sensitive] }; local:search('factbook', ('German', 'English'))
Hope this helps, Christian
On Thu, Aug 14, 2014 at 10:54 PM, Chris Yocum cyocum@gmail.com wrote:
Hi Christian,
Apologies for bringing this back up but if I use "using diacritics insensitive" in the full text search, it seems to turn full text searching off. I have diacritics true on the database. I am just suprised to see diacritics causing the full text searching to be turned off.
All the best, Chris
On Wed, Aug 13, 2014 at 01:18:26PM +0200, Christian Grün wrote:
Hi Chris,
there are various caches involved when evaluating queries, but I can't see for the given query where a cache may be utilized. However, your query may be evaluated faster if you simplify the nested where clause:
<results>{ subsequence( ft:mark( for $x in collection($col)//entry where $x//text() contains text { $term } using wildcards order by fn:lower-case( fn:replace(($x//orth[1]/text())[1], '\p{P}|\d+','') ) collation "?lang=ga" return $x ), 1, 5000 ) }</results>
You could as well use a predicate with position(), it may be evaluated faster than subsequence (I'm not sure, though, because most time will probably be spent for ordering all results):
<results>{ ft:mark( for $x in collection($col)//entry where $x//text() contains text { $term } using wildcards order by fn:lower-case( fn:replace(($x//orth[1]/text())[1], '\p{P}|\d+','') ) collation "?lang=ga" return $x )[position() = 1 to 5000] }</results>
Could you please open the InfoView in the GUI, execute the query again and check if the full-text index is applied?
Christian
On Wed, Aug 13, 2014 at 12:02 PM, Christopher Yocum cyocum@gmail.com wrote:
declare variable $term as xs:string external; declare variable $col as xs:string external; <results>{subsequence(ft:mark(for $x in collection($col)//entry where $x//text()[. contains text {$term} using wildcards] order by fn:lower-case(fn:replace(($x//orth[1]/text())[1], '\p{P}|\d+','')) collation "?lang=ga" return $x), 1, 5000)}</results>
Hi Christian,
I hope you had a good weekend!
Otherwise, no, this doesn't help as it doesn't choose to use the full text index on my content :(. This is what I am getting at the moment:
Compiling: - pre-evaluating fn:collection("edil") - simplifying descendant-or-self step(s) - converting descendant::*:entry to child steps - simplifying descendant-or-self step(s) - removing context expression (.) - rewriting where clause(s) - simplifying flwor expression
Query: declare variable $term as xs:string external := 'athgabāi.*'; declare variable $col as xs:string external := 'edil'; <results>{subsequence(ft:mark(for $x in collection($col)//entry where $x//text() contains text {$term} using diacritics insensitive using wildcards return $x), 1, 5000)}</results>
Optimized Query: element results { (fn:subsequence(ft:mark((db:open-pre("edil",0), db:open-pre("edil",155748), ...)/*:sample/*:entry[descendant::text() contains text "athgabāi.*" using wildcards using language 'English']), 1, 5000)) }
I tried this as well with the same results:
Compiling: - pre-evaluating fn:collection("edil") - simplifying descendant-or-self step(s) - converting descendant::*:entry to child steps - removing context expression (.) - rewriting where clause(s) - simplifying flwor expression
Query: declare variable $term as xs:string external := 'athgabāi.*'; declare variable $col as xs:string external := 'edil'; <results>{subsequence(ft:mark(for $x in collection($col)//entry where $x/descendant::*[text() contains text 'athgabāi.*' using diacritics insensitive using wildcards] return $x), 1, 5000)}</results> Optimized Query:
element results { (fn:subsequence(ft:mark((db:open-pre("edil",0), db:open-pre("edil",155748), ...)/*:sample/*:entry[descendant::*[text() contains text "athgabāi.*" using wildcards using language 'English']]), 1, 5000)) }
There are the options set on the database:
Database Properties Name: edil Size: 194 MB Nodes: 7951662 Documents: 19 Binaries: 0 Timestamp: 2014-08-15-17-00-29
Resource Properties Input Path: /home/cyocum/temp/edil_src/xml_src Input Size: 87 MB Timestamp: 2014-08-15-16-46-31 Encoding: UTF-8 CHOP: true
Indexes Up-to-date: true TEXTINDEX: true ATTRINDEX: true FTINDEX: true LANGUAGE: STEMMING: false CASESENS: false DIACRITICS: true STOPWORDS: UPDINDEX: false MAXCATS: 100 MAXLEN: 96
I hope this helps.
All the best, Chris
On Tue, Aug 19, 2014 at 10:12 AM, Christian Grün christian.gruen@gmail.com wrote:
Hi Chris,
sorry for letting you wait, I’ve been offline over the weekend.
Thank you again for all your help. Unfortunately, my documents are multi-language and multi-diacritics so my users expect it to match athgabáil, athgabail, and athgabāil as the same word. They also want wildcard searching to work in the same way.
This should be no problem, even with the full-text default settings. An example: the following query...
/descendant::*[text() contains text 'athgabāi.*' using diacritics insensitive using wildcards]
...will give you three results for the following document...
<xml> <term>athgabáil</term> <term>athgabail</term> <term>athgabāil</term> </xml>
...and the results will be retrieved by the full-text index, using the default settings:
- applying full-text index for "athgabāi.*" using wildcards using
language 'English'
The solution that I mentioned in my last mail is required if you want to do both diacritics sensitive and insensitive search.
Does this help? Christian
At the moment the query looks like this and it does not use the full text index:
declare variable $term as xs:string external := 'athgab.*'; declare
variable
$col as xs:string external := 'edil'; <results>{subsequence(ft:mark(for
$x
in collection($col)//entry where $x//text() contains text {$term} using wildcards using diacritics insensitive order by fn:lower-case(fn:replace(($x//orth[1]/text())[1], '\p{P}|\d+',''))
collation
"?lang=ga" return $x), 1, 5000)}</results>
If anyone has any suggestions, I would be grateful.
All the best, Chris
On Thu, Aug 14, 2014 at 10:35 PM, Christian Grün <
christian.gruen@gmail.com>
wrote:
Hi Chris,
as you already noted, the full-text index will only be utilized with the options that you choose when creating an index. If you want to do more fine-grained searches, it’s usually recommendable to choose the most general options for creating the index (case insensitive, diacritics insensitive, etc). and then refine the results in a second step. This can e.g. look as follows :
declare function local:search($db, $terms) { for $result in db:open($db)//*[text() contains text { $terms }] return $result[text() contains text { $terms } using case sensitive] }; local:search('factbook', ('German', 'English'))
Hope this helps, Christian
On Thu, Aug 14, 2014 at 10:54 PM, Chris Yocum cyocum@gmail.com wrote:
Hi Christian,
Apologies for bringing this back up but if I use "using diacritics insensitive" in the full text search, it seems to turn full text searching off. I have diacritics true on the database. I am just suprised to see diacritics causing the full text searching to be turned off.
All the best, Chris
On Wed, Aug 13, 2014 at 01:18:26PM +0200, Christian Grün wrote:
Hi Chris,
there are various caches involved when evaluating queries, but I
can't
see for the given query where a cache may be utilized. However, your query may be evaluated faster if you simplify the nested where
clause:
<results>{ subsequence( ft:mark( for $x in collection($col)//entry where $x//text() contains text { $term } using wildcards order by fn:lower-case( fn:replace(($x//orth[1]/text())[1], '\p{P}|\d+','') ) collation "?lang=ga" return $x ), 1, 5000 ) }</results>
You could as well use a predicate with position(), it may be
evaluated
faster than subsequence (I'm not sure, though, because most time will probably be spent for ordering all results):
<results>{ ft:mark( for $x in collection($col)//entry where $x//text() contains text { $term } using wildcards order by fn:lower-case( fn:replace(($x//orth[1]/text())[1], '\p{P}|\d+','') ) collation "?lang=ga" return $x )[position() = 1 to 5000] }</results>
Could you please open the InfoView in the GUI, execute the query
again
and check if the full-text index is applied?
Christian
On Wed, Aug 13, 2014 at 12:02 PM, Christopher Yocum <
cyocum@gmail.com>
wrote:
declare variable $term as xs:string external; declare variable $col as xs:string external; <results>{subsequence(ft:mark(for $x in collection($col)//entry where $x//text()[. contains text {$term} using wildcards] order by
fn:lower-case(fn:replace(($x//orth[1]/text())[1],
'\p{P}|\d+','')) collation "?lang=ga" return $x), 1, 5000)}</results>
Hi Chris,
DIACRITICS: true
It seems as if you set the diacritics option to true (which is equivalent to "diacritics sensitive", as it is supposed to say "consider diacritics: yes, please!"). Could you try to rebuild the index with the diacritics option disabled?
Christian
On Tue, Aug 19, 2014 at 2:19 PM, Christopher Yocum cyocum@gmail.com wrote:
Hi Christian,
I hope you had a good weekend!
Otherwise, no, this doesn't help as it doesn't choose to use the full text index on my content :(. This is what I am getting at the moment:
Compiling:
- pre-evaluating fn:collection("edil")
- simplifying descendant-or-self step(s)
- converting descendant::*:entry to child steps
- simplifying descendant-or-self step(s)
- removing context expression (.)
- rewriting where clause(s)
- simplifying flwor expression
Query: declare variable $term as xs:string external := 'athgabāi.*'; declare variable $col as xs:string external := 'edil'; <results>{subsequence(ft:mark(for $x in collection($col)//entry where $x//text() contains text {$term} using diacritics insensitive using wildcards return $x), 1, 5000)}</results>
Optimized Query: element results { (fn:subsequence(ft:mark((db:open-pre("edil",0), db:open-pre("edil",155748), ...)/*:sample/*:entry[descendant::text() contains text "athgabāi.*" using wildcards using language 'English']), 1, 5000)) }
I tried this as well with the same results:
Compiling:
- pre-evaluating fn:collection("edil")
- simplifying descendant-or-self step(s)
- converting descendant::*:entry to child steps
- removing context expression (.)
- rewriting where clause(s)
- simplifying flwor expression
Query: declare variable $term as xs:string external := 'athgabāi.*'; declare variable $col as xs:string external := 'edil'; <results>{subsequence(ft:mark(for $x in collection($col)//entry where $x/descendant::*[text() contains text 'athgabāi.*' using diacritics insensitive using wildcards] return $x), 1, 5000)}</results> Optimized Query:
element results { (fn:subsequence(ft:mark((db:open-pre("edil",0), db:open-pre("edil",155748), ...)/*:sample/*:entry[descendant::*[text() contains text "athgabāi.*" using wildcards using language 'English']]), 1, 5000)) }
There are the options set on the database:
Database Properties Name: edil Size: 194 MB Nodes: 7951662 Documents: 19 Binaries: 0 Timestamp: 2014-08-15-17-00-29
Resource Properties Input Path: /home/cyocum/temp/edil_src/xml_src Input Size: 87 MB Timestamp: 2014-08-15-16-46-31 Encoding: UTF-8 CHOP: true
Indexes Up-to-date: true TEXTINDEX: true ATTRINDEX: true FTINDEX: true LANGUAGE: STEMMING: false CASESENS: false DIACRITICS: true STOPWORDS: UPDINDEX: false MAXCATS: 100 MAXLEN: 96
I hope this helps.
All the best, Chris
Hi Christian,
Yes, that seems to make it work correctly. Maybe the wiki needs to be updated to be more clear about what "diacritics true" does? Apologies for the misunderstanding on my part.
All The Best, Chris
On Tue, Aug 19, 2014 at 1:38 PM, Christian Grün christian.gruen@gmail.com wrote:
Hi Chris,
DIACRITICS: true
It seems as if you set the diacritics option to true (which is equivalent to "diacritics sensitive", as it is supposed to say "consider diacritics: yes, please!"). Could you try to rebuild the index with the diacritics option disabled?
Christian
On Tue, Aug 19, 2014 at 2:19 PM, Christopher Yocum cyocum@gmail.com wrote:
Hi Christian,
I hope you had a good weekend!
Otherwise, no, this doesn't help as it doesn't choose to use the full
text
index on my content :(. This is what I am getting at the moment:
Compiling:
- pre-evaluating fn:collection("edil")
- simplifying descendant-or-self step(s)
- converting descendant::*:entry to child steps
- simplifying descendant-or-self step(s)
- removing context expression (.)
- rewriting where clause(s)
- simplifying flwor expression
Query: declare variable $term as xs:string external := 'athgabāi.*'; declare variable $col as xs:string external := 'edil'; <results>{subsequence(ft:mark(for $x in collection($col)//entry where $x//text() contains text {$term} using diacritics insensitive using wildcards return $x), 1, 5000)}</results>
Optimized Query: element results { (fn:subsequence(ft:mark((db:open-pre("edil",0), db:open-pre("edil",155748), ...)/*:sample/*:entry[descendant::text() contains text "athgabāi.*" using wildcards using language 'English']), 1, 5000)) }
I tried this as well with the same results:
Compiling:
- pre-evaluating fn:collection("edil")
- simplifying descendant-or-self step(s)
- converting descendant::*:entry to child steps
- removing context expression (.)
- rewriting where clause(s)
- simplifying flwor expression
Query: declare variable $term as xs:string external := 'athgabāi.*'; declare variable $col as xs:string external := 'edil'; <results>{subsequence(ft:mark(for $x in collection($col)//entry where $x/descendant::*[text() contains text 'athgabāi.*' using diacritics insensitive using wildcards] return $x), 1, 5000)}</results> Optimized Query:
element results { (fn:subsequence(ft:mark((db:open-pre("edil",0), db:open-pre("edil",155748), ...)/*:sample/*:entry[descendant::*[text() contains text "athgabāi.*" using wildcards using language 'English']]),
1,
5000)) }
There are the options set on the database:
Database Properties Name: edil Size: 194 MB Nodes: 7951662 Documents: 19 Binaries: 0 Timestamp: 2014-08-15-17-00-29
Resource Properties Input Path: /home/cyocum/temp/edil_src/xml_src Input Size: 87 MB Timestamp: 2014-08-15-16-46-31 Encoding: UTF-8 CHOP: true
Indexes Up-to-date: true TEXTINDEX: true ATTRINDEX: true FTINDEX: true LANGUAGE: STEMMING: false CASESENS: false DIACRITICS: true STOPWORDS: UPDINDEX: false MAXCATS: 100 MAXLEN: 96
I hope this helps.
All the best, Chris
Hi Chris,
Yes, that seems to make it work correctly. Maybe the wiki needs to be updated to be more clear about what "diacritics true" does?
I have slightly updated the text entries in our Wiki [1]. You are invited to register for the Wiki and update the text if you believe it could be further improved.
Beside that, I am glad to report that I have made our query optimizer a bit smarter. With the latest snapshot [2], your original query with the additional predicate will now be automatically rewritten to the second version, and will also be rewritten to take advantage of the full-text index.
Hope this helps, Christian
[1] http://docs.basex.org/wiki/Options#Full-Text [2] http://files.basex.org/releases/latest/
On Tue, Aug 19, 2014 at 1:38 PM, Christian Grün christian.gruen@gmail.com wrote:
Hi Chris,
DIACRITICS: true
It seems as if you set the diacritics option to true (which is equivalent to "diacritics sensitive", as it is supposed to say "consider diacritics: yes, please!"). Could you try to rebuild the index with the diacritics option disabled?
Christian
On Tue, Aug 19, 2014 at 2:19 PM, Christopher Yocum cyocum@gmail.com wrote:
Hi Christian,
I hope you had a good weekend!
Otherwise, no, this doesn't help as it doesn't choose to use the full text index on my content :(. This is what I am getting at the moment:
Compiling:
- pre-evaluating fn:collection("edil")
- simplifying descendant-or-self step(s)
- converting descendant::*:entry to child steps
- simplifying descendant-or-self step(s)
- removing context expression (.)
- rewriting where clause(s)
- simplifying flwor expression
Query: declare variable $term as xs:string external := 'athgabāi.*'; declare variable $col as xs:string external := 'edil'; <results>{subsequence(ft:mark(for $x in collection($col)//entry where $x//text() contains text {$term} using diacritics insensitive using wildcards return $x), 1, 5000)}</results>
Optimized Query: element results { (fn:subsequence(ft:mark((db:open-pre("edil",0), db:open-pre("edil",155748), ...)/*:sample/*:entry[descendant::text() contains text "athgabāi.*" using wildcards using language 'English']), 1, 5000)) }
I tried this as well with the same results:
Compiling:
- pre-evaluating fn:collection("edil")
- simplifying descendant-or-self step(s)
- converting descendant::*:entry to child steps
- removing context expression (.)
- rewriting where clause(s)
- simplifying flwor expression
Query: declare variable $term as xs:string external := 'athgabāi.*'; declare variable $col as xs:string external := 'edil'; <results>{subsequence(ft:mark(for $x in collection($col)//entry where $x/descendant::*[text() contains text 'athgabāi.*' using diacritics insensitive using wildcards] return $x), 1, 5000)}</results> Optimized Query:
element results { (fn:subsequence(ft:mark((db:open-pre("edil",0), db:open-pre("edil",155748), ...)/*:sample/*:entry[descendant::*[text() contains text "athgabāi.*" using wildcards using language 'English']]), 1, 5000)) }
There are the options set on the database:
Database Properties Name: edil Size: 194 MB Nodes: 7951662 Documents: 19 Binaries: 0 Timestamp: 2014-08-15-17-00-29
Resource Properties Input Path: /home/cyocum/temp/edil_src/xml_src Input Size: 87 MB Timestamp: 2014-08-15-16-46-31 Encoding: UTF-8 CHOP: true
Indexes Up-to-date: true TEXTINDEX: true ATTRINDEX: true FTINDEX: true LANGUAGE: STEMMING: false CASESENS: false DIACRITICS: true STOPWORDS: UPDINDEX: false MAXCATS: 100 MAXLEN: 96
I hope this helps.
All the best, Chris
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA256
On Thu, Aug 21, 2014 at 11:00:52PM +0200, Christian Grün wrote:
Hi Chris,
Yes, that seems to make it work correctly. Maybe the wiki needs to be updated to be more clear about what "diacritics true" does?
I have slightly updated the text entries in our Wiki [1]. You are invited to register for the Wiki and update the text if you believe it could be further improved.
Thanks!
Beside that, I am glad to report that I have made our query optimizer a bit smarter. With the latest snapshot [2], your original query with the additional predicate will now be automatically rewritten to the second version, and will also be rewritten to take advantage of the full-text index.
Fantastic. Thank you very much for all of your efforts.
Chris
Hope this helps, Christian
[1] http://docs.basex.org/wiki/Options#Full-Text [2] http://files.basex.org/releases/latest/
On Tue, Aug 19, 2014 at 1:38 PM, Christian Grün christian.gruen@gmail.com wrote:
Hi Chris,
DIACRITICS: true
It seems as if you set the diacritics option to true (which is equivalent to "diacritics sensitive", as it is supposed to say "consider diacritics: yes, please!"). Could you try to rebuild the index with the diacritics option disabled?
Christian
On Tue, Aug 19, 2014 at 2:19 PM, Christopher Yocum cyocum@gmail.com wrote:
Hi Christian,
I hope you had a good weekend!
Otherwise, no, this doesn't help as it doesn't choose to use the full text index on my content :(. This is what I am getting at the moment:
Compiling:
- pre-evaluating fn:collection("edil")
- simplifying descendant-or-self step(s)
- converting descendant::*:entry to child steps
- simplifying descendant-or-self step(s)
- removing context expression (.)
- rewriting where clause(s)
- simplifying flwor expression
Query: declare variable $term as xs:string external := 'athgabāi.*'; declare variable $col as xs:string external := 'edil'; <results>{subsequence(ft:mark(for $x in collection($col)//entry where $x//text() contains text {$term} using diacritics insensitive using wildcards return $x), 1, 5000)}</results>
Optimized Query: element results { (fn:subsequence(ft:mark((db:open-pre("edil",0), db:open-pre("edil",155748), ...)/*:sample/*:entry[descendant::text() contains text "athgabāi.*" using wildcards using language 'English']), 1, 5000)) }
I tried this as well with the same results:
Compiling:
- pre-evaluating fn:collection("edil")
- simplifying descendant-or-self step(s)
- converting descendant::*:entry to child steps
- removing context expression (.)
- rewriting where clause(s)
- simplifying flwor expression
Query: declare variable $term as xs:string external := 'athgabāi.*'; declare variable $col as xs:string external := 'edil'; <results>{subsequence(ft:mark(for $x in collection($col)//entry where $x/descendant::*[text() contains text 'athgabāi.*' using diacritics insensitive using wildcards] return $x), 1, 5000)}</results> Optimized Query:
element results { (fn:subsequence(ft:mark((db:open-pre("edil",0), db:open-pre("edil",155748), ...)/*:sample/*:entry[descendant::*[text() contains text "athgabāi.*" using wildcards using language 'English']]), 1, 5000)) }
There are the options set on the database:
Database Properties Name: edil Size: 194 MB Nodes: 7951662 Documents: 19 Binaries: 0 Timestamp: 2014-08-15-17-00-29
Resource Properties Input Path: /home/cyocum/temp/edil_src/xml_src Input Size: 87 MB Timestamp: 2014-08-15-16-46-31 Encoding: UTF-8 CHOP: true
Indexes Up-to-date: true TEXTINDEX: true ATTRINDEX: true FTINDEX: true LANGUAGE: STEMMING: false CASESENS: false DIACRITICS: true STOPWORDS: UPDINDEX: false MAXCATS: 100 MAXLEN: 96
I hope this helps.
All the best, Chris
basex-talk@mailman.uni-konstanz.de