Guidance on Indexing

List overview All Threads
Download

newer

older

XQery Performance-Problem when...

ft:search and map-function

Mansi Sheth

3 Jan 2016 3 Jan '16

1:17 p.m.

Hello,

A very happy new year to all of you !!!

I have some very basic questions with indexing.

1. Most of my xqueries are of below nature

'/Archives/descendant::apiCalls[contains(@name,"com.sun")]/@name', where apiCalls could be 3-4 level under 'Archives'. Xqueries are accessed via REST

Based on this, I used attribute indexing, after each update to DB. Am I correct ? Should I have been using fulltext indexing instead ? Why ?

2. I have 1000s of documents, spanning over 100 XML DB, with total space around 400 GB currently. Each query is taking roughly 30 mins, to run. Though expectable performance, but I know I can do better with indexing. Currently, when I looked at one of the DBs,

...

open bi_output_3

Database 'bi_output_3' was opened in 38.22 ms.

...

info db

Database Properties Name: bi_output_3 Size: 3938 MB Nodes: 16193129 Documents: 35 Binaries: 0 Timestamp: 2016-01-03T13:40:40.000Z

Resource Properties Timestamp: 2016-01-03T13:40:40.776Z Encoding: UTF-8 CHOP: true

Indexes Up-to-date: false TEXTINDEX: false ATTRINDEX: false FTINDEX: false LANGUAGE: English STEMMING: false CASESENS: false DIACRITICS: false STOPWORDS: UPDINDEX: false AUTOOPTIMIZE: false MAXCATS: 100 MAXLEN: 96

When looked at its HDD footprint:

ubuntu@<abc>/BaseXDB/bi_output_3$ ls -l total 4032992 -rw-rw-r-- 1 ubuntu ubuntu 2209449064 Jan 1 17:00 atv.basex -rw-rw-r-- 1 ubuntu ubuntu 4 Jan 1 16:35 atvl.basex -rw-rw-r-- 1 ubuntu ubuntu 0 Jan 1 16:35 atvr.basex -rw-rw-r-- 1 ubuntu ubuntu 6414 Jan 3 13:40 doc.basex -rw-rw-r-- 1 ubuntu ubuntu 6 Jan 1 17:00 ftxx.basex -rw-rw-r-- 1 ubuntu ubuntu 0 Jan 1 17:00 ftxy.basex -rw-rw-r-- 1 ubuntu ubuntu 0 Jan 1 17:00 ftxz.basex -rw-rw-r-- 1 ubuntu ubuntu 829 Jan 3 13:40 inf.basex -rw-rw-r-- 1 ubuntu ubuntu 28 Jan 1 17:00 swl.basex -rw-rw-r-- 1 ubuntu ubuntu 1916444672 Jan 3 13:40 tbl.basex -rw-rw-r-- 1 ubuntu ubuntu 3796037 Jan 3 13:40 tbli.basex -rw-rw-r-- 1 ubuntu ubuntu 45462 Jan 1 17:00 txt.basex -rw-rw-r-- 1 ubuntu ubuntu 4 Jan 1 16:35 txtl.basex -rw-rw-r-- 1 ubuntu ubuntu 0 Jan 1 16:35 txtr.basex ubuntu@<abc>/BaseXDB/bi_output_3$ pwd /veracode/msheth/BaseXDB/bi_output_3 ubuntu@<abc>/BaseXDB/bi_output_3$

My concern is, at each DB update, I am using attribute indexing, but info command on basex prompt tells me otherwise. Am I misreading something ? Is there a way to fix this once DB is created ? Its takes me 48 hours, to create DBs from scratch... :)

Reading thru UPDINDEX and AUTOOPTIMIZE ALL commands, tells me to open each DB and run these commands. Is that my option ? Do we have a xquery script somewhere which I can use to do this ?

Thanks, - Mansi

Attachments:

attachment.html (text/html — 3.7 KB)

Show replies by date

Christian Grün

3 Jan 3 Jan

4:52 p.m.

Hi Mansi,

...

Most of my xqueries are of below nature

'/Archives/descendant::apiCalls[contains(@name,"com.sun")]/@name', where apiCalls could be 3-4 level under 'Archives'. Xqueries are accessed via REST

The existing index structures won’t allow you to look for arbitrary sub strings; see [1] for more information.

You are right, the full-text index may be a possibly way out. Prefix searches can be realized via the "using wildcards" option [2]:

//*[text() contains text "abc.*" using wildcards

Please note that the query string will always be "tokenized": if you are looking for "com.sun", you will also get results like "COM SUN!".

...

I have 1000s of documents, spanning over 100 XML DB, with total space

around 400 GB currently. Each query is taking roughly 30 mins, to run.

My concern is, at each DB update, I am using attribute indexing, but info command on basex prompt tells me otherwise. Am I misreading something ? Is there a way to fix this once DB is created ? Its takes me 48 hours, to create DBs from scratch... :)

If UPDINDEX and AUTOOPTIMIZE is false, you will need to call "OPTIMIZE" after your updates.

If you create a new database, you can set UPDINDEX and AUTOOPTIMIZE to true. However, AUTOOPTIMIZE will get incredibly slow if you are working with gigabytes of XML data.

...

Reading thru UPDINDEX and AUTOOPTIMIZE ALL commands, tells me to open each DB and run these commands. Is that my option ? Do we have a xquery script somewhere which I can use to do this ?

If your databases are called "db1" ... "db100", the following XQuery script will optimize all those databases:

for $i in 1 to 100 return db:optimize('db' || $i)

You can also create a command script [3] with XQuery:

<commands>{ for $i in 1 to 100 return ( <open>{ 'db' || $i }</open>, <optimize/> ) }</commands>

You can store the result as a .bxs file and run it afterwards.

Before you create all index structures, you should probably run your queries on some smaller database instances and check out the "Query Info" panel in the GUI. It will tell you if an index is used or not.

Best, Christian

[1] http://docs.basex.org/wiki/Indexes#Value_Indexes [2] http://docs.basex.org/wiki/Full-Text#Match_Options [3] http://docs.basex.org/wiki/Commands#Command_Scripts

Mansi Sheth

5:54 p.m.

Thanks Christian as always was a quick and detailed response.

1. I am not 100% clear, if you are motivating me towards or against FULLTEXT indexing :)

2. Yes I am dealing with GBs of XML files. I create new Databases, using JAVA API using CreateDB class. Should I be using MainOptions to set AUTOOPTIMIZE and UPDINDEX options before each new db creation ? In MainOptions class, I didn't find any auto optimize option, am I missing something ? Since, I am anyways setting options thru this method, should I also set FTINDEX or ATTRINDEX (based on your response 1) attribute as well, before creating each DB ? I would hate to run optimization script after each DB update (updates happens daily).

Please advice, - Mansi

On Sun, Jan 3, 2016 at 4:52 PM, Christian Grün christian.gruen@gmail.com wrote:

...

Hi Mansi,

...

Most of my xqueries are of below nature

'/Archives/descendant::apiCalls[contains(@name,"com.sun")]/@name', where apiCalls could be 3-4 level under 'Archives'. Xqueries are accessed via

REST

The existing index structures won’t allow you to look for arbitrary sub strings; see [1] for more information.

You are right, the full-text index may be a possibly way out. Prefix searches can be realized via the "using wildcards" option [2]:

//*[text() contains text "abc.*" using wildcards

Please note that the query string will always be "tokenized": if you are looking for "com.sun", you will also get results like "COM SUN!".

...

I have 1000s of documents, spanning over 100 XML DB, with total space

around 400 GB currently. Each query is taking roughly 30 mins, to run.

My concern is, at each DB update, I am using attribute indexing, but info command on basex prompt tells me otherwise. Am I misreading something ?

Is

...
there a way to fix this once DB is created ? Its takes me 48 hours, to create DBs from scratch... :)

If UPDINDEX and AUTOOPTIMIZE is false, you will need to call "OPTIMIZE" after your updates.

If you create a new database, you can set UPDINDEX and AUTOOPTIMIZE to true. However, AUTOOPTIMIZE will get incredibly slow if you are working with gigabytes of XML data.

...
Reading thru UPDINDEX and AUTOOPTIMIZE ALL commands, tells me to open

each

...
DB and run these commands. Is that my option ? Do we have a xquery script somewhere which I can use to do this ?

If your databases are called "db1" ... "db100", the following XQuery script will optimize all those databases:

for $i in 1 to 100 return db:optimize('db' || $i)

You can also create a command script [3] with XQuery:

<commands>{ for $i in 1 to 100 return ( <open>{ 'db' || $i }</open>, <optimize/> ) }</commands>

You can store the result as a .bxs file and run it afterwards.

Before you create all index structures, you should probably run your queries on some smaller database instances and check out the "Query Info" panel in the GUI. It will tell you if an index is used or not.

Best, Christian

[1] http://docs.basex.org/wiki/Indexes#Value_Indexes [2] http://docs.basex.org/wiki/Full-Text#Match_Options [3] http://docs.basex.org/wiki/Commands#Command_Scripts

-- - Mansi

Ron Katriel

7:23 p.m.

New subject: Full-Text Search with Stopwords: corner case hehavior

Hi,

I noticed an unexpected behavior with full-text matching using stop words. The actual code is somewhat complex (it matches CT.gov trials with sponsor studies) but I was able to distill it to a simple expression:

"Superior Laboratories" contains text { "Medical Affairs" } using stop words ( "medical", "affairs” )

“Superior Laboratories” is the name of a (made up) sponsor and “Medical Affairs” is the value of an XML element (clinical_study/overall_official/affiliation) in an actual CT.gov trial (http://clinicaltrials.gov/search?term=NCT00775398&resultsxml=true).

This expression evaluates to true because “Superior Laboratories” vacuously contains the empty string (i.e., what is left after the stop words are removed from the official affiliation).

In actuality the stop words are loaded from a file containing over 400 words. The idea is to remove frequently occurring words from sponsor names (e.g., laboratories, limited, medical, pharmaceutical, etc.) to increase the chances of matching.

Is the above behavior intentional or an artifact of the way the matching is implemented? If the former, is there a way - without removing the stop words from the file - to override this behavior in XQuery so the above match will fail?

Thanks, Ron

Christian Grün

7:41 p.m.

New subject: Full-Text Search with Stopwords: corner case hehavior

Hi Ron,

...

"Superior Laboratories" contains text { "Medical Affairs" } using stop
words ( "medical", "affairs” )

I’m pretty sure that "true" is the right answer here. I must admit that, due to the variety of options provided by the XQFT spec, it’s often not too obvious what’s going on.

...

is there a way - without removing the stop words from the file - to override this behavior in XQuery so the above match will fail?

Maybe an additional check could be used after the first 'contains text' expression. In what particular cases would you like to get 'false' as result?

Christian

Ron Katriel

8:07 p.m.

New subject: Full-Text Search with Stopwords: corner case hehavior

Hi Christian,

The behavior I am looking for is getting back false whenever the text following ‘contains text' is reduced to an empty string. Is there a simple what of checking that?

Thanks, Ron

On January 3, 2016 at 7:41:47 PM, Christian Grün (christian.gruen@gmail.com) wrote:

Hi Ron,

...

"Superior Laboratories" contains text { "Medical Affairs" } using stop words ( "medical", "affairs” )

I’m pretty sure that "true" is the right answer here. I must admit that, due to the variety of options provided by the XQFT spec, it’s often not too obvious what’s going on.

...

is there a way - without removing the stop words from the file - to override this behavior in XQuery so the above match will fail?

Maybe an additional check could be used after the first 'contains text' expression. In what particular cases would you like to get 'false' as result?

Christian

Christian Grün

8:14 p.m.

New subject: Full-Text Search with Stopwords: corner case hehavior

...

The behavior I am looking for is getting back false whenever the text following ‘contains text' is reduced to an empty string. Is there a simple what of checking that?

Hm, sounds easy, but I don’t have an easy answer to that. We should probably extend our ft:tokenize function to also take a stopword option.

What you can always do is write some additional code:

declare function local:sw($terms, $sw) { let $sw := file:read-text-lines($sw) return $terms contains text { $sw } all words }; if(local:sw('query terms', 'sw.txt')) then ...

...

On January 3, 2016 at 7:41:47 PM, Christian Grün (christian.gruen@gmail.com) wrote:

Hi Ron,

...
"Superior Laboratories" contains text { "Medical Affairs" } using stop words ( "medical", "affairs” )

I’m pretty sure that "true" is the right answer here. I must admit that, due to the variety of options provided by the XQFT spec, it’s often not too obvious what’s going on.

...
is there a way - without removing the stop words from the file - to override this behavior in XQuery so the above match will fail?

Maybe an additional check could be used after the first 'contains text' expression. In what particular cases would you like to get 'false' as result?

Christian

Ron Katriel

8:34 p.m.

New subject: Full-Text Search with Stopwords: corner case hehavior

Thanks, Christian. I will look into the solution you suggested. Will need to cache the stop words to avoid repeatedly opening the file for reading.

Ron

On January 3, 2016 at 8:14:51 PM, Christian Grün (christian.gruen@gmail.com) wrote:

...

The behavior I am looking for is getting back false whenever the text following ‘contains text' is reduced to an empty string. Is there a simple what of checking that?

Hm, sounds easy, but I don’t have an easy answer to that. We should probably extend our ft:tokenize function to also take a stopword option.

What you can always do is write some additional code:

declare function local:sw($terms, $sw) { let $sw := file:read-text-lines($sw) return $terms contains text { $sw } all words }; if(local:sw('query terms', 'sw.txt')) then ...

...

On January 3, 2016 at 7:41:47 PM, Christian Grün (christian.gruen@gmail.com) wrote:

Hi Ron,

...
"Superior Laboratories" contains text { "Medical Affairs" } using stop words ( "medical", "affairs” )

I’m pretty sure that "true" is the right answer here. I must admit that, due to the variety of options provided by the XQFT spec, it’s often not too obvious what’s going on.

...
is there a way - without removing the stop words from the file - to override this behavior in XQuery so the above match will fail?

Maybe an additional check could be used after the first 'contains text' expression. In what particular cases would you like to get 'false' as result?

Christian

Ron Katriel

5 Jan 5 Jan

10:26 a.m.

New subject: Full-Text Search with Stopwords: corner case hehavior

Hi Christian,

One follow up question. I thought stop words work in concert with the thesaurus but I came across a case where they do not seem to. The following query returns false

"Samsung" contains text "Samsung Bioepis Co., Ltd." using fuzzy using stop words ( "co", "ltd") using thesaurus at "thesaurus.xml"

even though the thesaurus contains the following

<entry> <term>Samsung Bioepis</term> <synonym> <term>Samsung</term> <relationship>BT</relationship> </synonym> </entry>

When I add the following synonym to the entry

<synonym> <term>Samsung Bioepis Co., Ltd.</term> <relationship>USE</relationship> </synonym> the query matches. Am I missing something?

Thanks, Ron

On January 3, 2016 at 8:33:14 PM, Ron Katriel (rkatriel@mdsol.com) wrote:

Thanks, Christian. I will look into the solution you suggested. Will need to cache the stop words to avoid repeatedly opening the file for reading.

Ron

On January 3, 2016 at 8:14:51 PM, Christian Grün (christian.gruen@gmail.com) wrote:

...

The behavior I am looking for is getting back false whenever the text following ‘contains text' is reduced to an empty string. Is there a simple what of checking that?

Hm, sounds easy, but I don’t have an easy answer to that. We should probably extend our ft:tokenize function to also take a stopword option.

What you can always do is write some additional code:

declare function local:sw($terms, $sw) { let $sw := file:read-text-lines($sw) return $terms contains text { $sw } all words }; if(local:sw('query terms', 'sw.txt')) then ...

...

On January 3, 2016 at 7:41:47 PM, Christian Grün (christian.gruen@gmail.com) wrote:

Hi Ron,

...
"Superior Laboratories" contains text { "Medical Affairs" } using stop words ( "medical", "affairs” )

I’m pretty sure that "true" is the right answer here. I must admit that, due to the variety of options provided by the XQFT spec, it’s often not too obvious what’s going on.

...
is there a way - without removing the stop words from the file - to override this behavior in XQuery so the above match will fail?

Maybe an additional check could be used after the first 'contains text' expression. In what particular cases would you like to get 'false' as result?

Christian

Christian Grün

10:29 a.m.

New subject: Full-Text Search with Stopwords: corner case hehavior

Phew… My guess is that no one has seriously looked at the interplay between stop words and the thesaurus so far ;) Maybe (lower/upper) case plays a role, too?

On Tue, Jan 5, 2016 at 4:26 PM, Ron Katriel rkatriel@mdsol.com wrote:

...

Hi Christian,

One follow up question. I thought stop words work in concert with the thesaurus but I came across a case where they do not seem to. The following query returns false
"Samsung" contains text "Samsung Bioepis Co., Ltd." using fuzzy using
stop words ( "co", "ltd") using thesaurus at "thesaurus.xml"

even though the thesaurus contains the following

<entry> <term>Samsung Bioepis</term> <synonym> <term>Samsung</term> <relationship>BT</relationship> </synonym> </entry>

When I add the following synonym to the entry
<synonym>
  <term>Samsung Bioepis Co., Ltd.</term>
  <relationship>USE</relationship>
</synonym>
the query matches. Am I missing something?

Thanks, Ron

On January 3, 2016 at 8:33:14 PM, Ron Katriel (rkatriel@mdsol.com) wrote:

Thanks, Christian. I will look into the solution you suggested. Will need to cache the stop words to avoid repeatedly opening the file for reading.

Ron

On January 3, 2016 at 8:14:51 PM, Christian Grün (christian.gruen@gmail.com) wrote:

...
The behavior I am looking for is getting back false whenever the text following ‘contains text' is reduced to an empty string. Is there a simple what of checking that?

Hm, sounds easy, but I don’t have an easy answer to that. We should probably extend our ft:tokenize function to also take a stopword option.

What you can always do is write some additional code:

declare function local:sw($terms, $sw) { let $sw := file:read-text-lines($sw) return $terms contains text { $sw } all words }; if(local:sw('query terms', 'sw.txt')) then ...

...
On January 3, 2016 at 7:41:47 PM, Christian Grün (christian.gruen@gmail.com) wrote:

Hi Ron,

...
"Superior Laboratories" contains text { "Medical Affairs" } using stop words ( "medical", "affairs” )

I’m pretty sure that "true" is the right answer here. I must admit that, due to the variety of options provided by the XQFT spec, it’s often not too obvious what’s going on.

...
is there a way - without removing the stop words from the file - to override this behavior in XQuery so the above match will fail?

Maybe an additional check could be used after the first 'contains text' expression. In what particular cases would you like to get 'false' as result?

Christian

Ron Katriel

10:47 a.m.

New subject: Full-Text Search with Stopwords: corner case hehavior

Good catch. Case appears to also play a role. The following does not match

"samsung" contains text "samsung bioepis co., ltd." using fuzzy using stop words ( "co", "ltd") using thesaurus at "thesaurus.xml"

even when the thesaurus contains the synonym "Samsung Bioepis Co., Ltd.”

I tried the other way around (thesaurus in lower case, query in mixed case) and it also fails to match.

Ron

On January 5, 2016 at 10:29:35 AM, Christian Grün (christian.gruen@gmail.com) wrote:

Phew… My guess is that no one has seriously looked at the interplay between stop words and the thesaurus so far ;) Maybe (lower/upper) case plays a role, too?

On Tue, Jan 5, 2016 at 4:26 PM, Ron Katriel rkatriel@mdsol.com wrote:

...

Hi Christian,

One follow up question. I thought stop words work in concert with the thesaurus but I came across a case where they do not seem to. The following query returns false

"Samsung" contains text "Samsung Bioepis Co., Ltd." using fuzzy using stop words ( "co", "ltd") using thesaurus at "thesaurus.xml"

even though the thesaurus contains the following

<entry> <term>Samsung Bioepis</term> <synonym> <term>Samsung</term> <relationship>BT</relationship> </synonym> </entry>

When I add the following synonym to the entry

<synonym> <term>Samsung Bioepis Co., Ltd.</term> <relationship>USE</relationship> </synonym>

the query matches. Am I missing something?

Thanks, Ron

On January 3, 2016 at 8:33:14 PM, Ron Katriel (rkatriel@mdsol.com) wrote:

Thanks, Christian. I will look into the solution you suggested. Will need to cache the stop words to avoid repeatedly opening the file for reading.

Ron

On January 3, 2016 at 8:14:51 PM, Christian Grün (christian.gruen@gmail.com) wrote:

...
The behavior I am looking for is getting back false whenever the text following ‘contains text' is reduced to an empty string. Is there a simple what of checking that?

Hm, sounds easy, but I don’t have an easy answer to that. We should probably extend our ft:tokenize function to also take a stopword option.

What you can always do is write some additional code:

declare function local:sw($terms, $sw) { let $sw := file:read-text-lines($sw) return $terms contains text { $sw } all words }; if(local:sw('query terms', 'sw.txt')) then ...

...
On January 3, 2016 at 7:41:47 PM, Christian Grün (christian.gruen@gmail.com) wrote:

Hi Ron,

...
"Superior Laboratories" contains text { "Medical Affairs" } using stop words ( "medical", "affairs” )

I’m pretty sure that "true" is the right answer here. I must admit that, due to the variety of options provided by the XQFT spec, it’s often not too obvious what’s going on.

...
is there a way - without removing the stop words from the file - to override this behavior in XQuery so the above match will fail?

Maybe an additional check could be used after the first 'contains text' expression. In what particular cases would you like to get 'false' as result?

Christian

Ron Katriel

2:32 p.m.

New subject: Full-Text Search with Stopwords: corner case hehavior

Christian,

Here is another strange behavior (not involving the thesaurus):

"Bayer Pharma AG" contains text "community medical associates" using stop words ("community", "medical", "associates")

returns ‘true’ while

"Bayer" contains text "community medical associates" using stop words ("community", "medical", "associates")

returns ‘false’.

Any idea why the behavior is different?

Thanks, Ron

On January 5, 2016 at 10:46:35 AM, Ron Katriel (rkatriel@mdsol.com) wrote:

Good catch. Case appears to also play a role. The following does not match

"samsung" contains text "samsung bioepis co., ltd." using fuzzy using stop words ( "co", "ltd") using thesaurus at "thesaurus.xml"

even when the thesaurus contains the synonym "Samsung Bioepis Co., Ltd.”

I tried the other way around (thesaurus in lower case, query in mixed case) and it also fails to match.

Ron

On January 5, 2016 at 10:29:35 AM, Christian Grün (christian.gruen@gmail.com) wrote:

Phew… My guess is that no one has seriously looked at the interplay between stop words and the thesaurus so far ;) Maybe (lower/upper) case plays a role, too?

On Tue, Jan 5, 2016 at 4:26 PM, Ron Katriel rkatriel@mdsol.com wrote:

...

Hi Christian,

One follow up question. I thought stop words work in concert with the thesaurus but I came across a case where they do not seem to. The following query returns false

"Samsung" contains text "Samsung Bioepis Co., Ltd." using fuzzy using stop words ( "co", "ltd") using thesaurus at "thesaurus.xml"

even though the thesaurus contains the following

<entry> <term>Samsung Bioepis</term> <synonym> <term>Samsung</term> <relationship>BT</relationship> </synonym> </entry>

When I add the following synonym to the entry

<synonym> <term>Samsung Bioepis Co., Ltd.</term> <relationship>USE</relationship> </synonym>

the query matches. Am I missing something?

Thanks, Ron

On January 3, 2016 at 8:33:14 PM, Ron Katriel (rkatriel@mdsol.com) wrote:

Thanks, Christian. I will look into the solution you suggested. Will need to cache the stop words to avoid repeatedly opening the file for reading.

Ron

On January 3, 2016 at 8:14:51 PM, Christian Grün (christian.gruen@gmail.com) wrote:

...
The behavior I am looking for is getting back false whenever the text following ‘contains text' is reduced to an empty string. Is there a simple what of checking that?

Hm, sounds easy, but I don’t have an easy answer to that. We should probably extend our ft:tokenize function to also take a stopword option.

What you can always do is write some additional code:

declare function local:sw($terms, $sw) { let $sw := file:read-text-lines($sw) return $terms contains text { $sw } all words }; if(local:sw('query terms', 'sw.txt')) then ...

...
On January 3, 2016 at 7:41:47 PM, Christian Grün (christian.gruen@gmail.com) wrote:

Hi Ron,

...
"Superior Laboratories" contains text { "Medical Affairs" } using stop words ( "medical", "affairs” )

I’m pretty sure that "true" is the right answer here. I must admit that, due to the variety of options provided by the XQFT spec, it’s often not too obvious what’s going on.

...
is there a way - without removing the stop words from the file - to override this behavior in XQuery so the above match will fail?

Maybe an additional check could be used after the first 'contains text' expression. In what particular cases would you like to get 'false' as result?

Christian

Christian Grün

3:33 p.m.

New subject: Full-Text Search with Stopwords: corner case hehavior

Hi Ron,

...

Here is another strange behavior (not involving the thesaurus):

This time it’s completely due to the spec. For some reasons, the word counter is incremented for both the input and query strings [1]. Because of that, the following query returns true, because "few" in the input string will be skipped due the existence of "of" in the query string:

"propagating few errors" contains text "propagating of errors" using stop words ("of")

As a result, the input strings must not be shorter than the query string.

To better control the stop word behavior, it can be helpful to ignore the XQFT stop word support and do it by yourself (e.g., drop all keywords out of your query string before using "contains text"). In the following example, I used a map to speed up keywords lookup:

let $input := db:open(...) let $sw := map:merge( for $sw in file:read-text-lines('sw.txt') return map { $sw : true() } ) let $qt := ft:tokenize("query terms")[not($sw(.))] return $input contains text { $qt }

Christian

[1] http://www.w3.org/TR/xpath-full-text-10/#ftstopwordoption

...

"Bayer Pharma AG" contains text "community medical associates" using
stop words ("community", "medical", "associates")

returns ‘true’ while
"Bayer" contains text "community medical associates" using stop words
("community", "medical", "associates")

returns ‘false’.

Any idea why the behavior is different?

Thanks, Ron

On January 5, 2016 at 10:46:35 AM, Ron Katriel (rkatriel@mdsol.com) wrote:

Good catch. Case appears to also play a role. The following does not match
"samsung" contains text "samsung bioepis co., ltd." using fuzzy using
stop words ( "co", "ltd") using thesaurus at "thesaurus.xml"

even when the thesaurus contains the synonym "Samsung Bioepis Co., Ltd.”

I tried the other way around (thesaurus in lower case, query in mixed case) and it also fails to match.

Ron

On January 5, 2016 at 10:29:35 AM, Christian Grün (christian.gruen@gmail.com) wrote:

Phew… My guess is that no one has seriously looked at the interplay between stop words and the thesaurus so far ;) Maybe (lower/upper) case plays a role, too?

On Tue, Jan 5, 2016 at 4:26 PM, Ron Katriel rkatriel@mdsol.com wrote:

...
Hi Christian,

One follow up question. I thought stop words work in concert with the thesaurus but I came across a case where they do not seem to. The following query returns false

"Samsung" contains text "Samsung Bioepis Co., Ltd." using fuzzy using stop words ( "co", "ltd") using thesaurus at "thesaurus.xml"

even though the thesaurus contains the following

<entry> <term>Samsung Bioepis</term> <synonym> <term>Samsung</term> <relationship>BT</relationship> </synonym> </entry>

When I add the following synonym to the entry

<synonym> <term>Samsung Bioepis Co., Ltd.</term> <relationship>USE</relationship> </synonym>

the query matches. Am I missing something?

Thanks, Ron

On January 3, 2016 at 8:33:14 PM, Ron Katriel (rkatriel@mdsol.com) wrote:

Thanks, Christian. I will look into the solution you suggested. Will need to cache the stop words to avoid repeatedly opening the file for reading.

Ron

On January 3, 2016 at 8:14:51 PM, Christian Grün (christian.gruen@gmail.com) wrote:

...
The behavior I am looking for is getting back false whenever the text following ‘contains text' is reduced to an empty string. Is there a simple what of checking that?

Hm, sounds easy, but I don’t have an easy answer to that. We should probably extend our ft:tokenize function to also take a stopword option.

What you can always do is write some additional code:

declare function local:sw($terms, $sw) { let $sw := file:read-text-lines($sw) return $terms contains text { $sw } all words }; if(local:sw('query terms', 'sw.txt')) then ...

...
On January 3, 2016 at 7:41:47 PM, Christian Grün (christian.gruen@gmail.com) wrote:

Hi Ron,

...
"Superior Laboratories" contains text { "Medical Affairs" } using stop words ( "medical", "affairs” )

I’m pretty sure that "true" is the right answer here. I must admit that, due to the variety of options provided by the XQFT spec, it’s often not too obvious what’s going on.

...
is there a way - without removing the stop words from the file - to override this behavior in XQuery so the above match will fail?

Maybe an additional check could be used after the first 'contains text' expression. In what particular cases would you like to get 'false' as result?

Christian

Christian Grün

3 Jan 3 Jan

7:31 p.m.

Hi Mansi,

...

I am not 100% clear, if you are motivating me towards or against FULLTEXT

indexing :)

This is something you’ll have to answer by yourself; it depends on the kind of queries and on your ability to store attribute values as texts.

...

Yes I am dealing with GBs of XML files. I create new Databases, using

JAVA API using CreateDB class. Should I be using MainOptions to set AUTOOPTIMIZE and UPDINDEX options before each new db creation ? In MainOptions class, I didn't find any auto optimize option, am I missing something ? Since, I am anyways setting options thru this method, should I also set FTINDEX or ATTRINDEX (based on your response 1) attribute as well, before creating each DB ?

As indicated, AUTOOPTIMIZE is no viable choice for data instances of that size. UPDINDEX may be a suitable, but before creating any index structures, I advise you to first do some testing with smaller instances. Only after that, you will know which index structures you need for speeding up your queries. I hope our Wiki articles on index structures and the full-text feature are helpful in that regard.

Christian

...

On Sun, Jan 3, 2016 at 4:52 PM, Christian Grün christian.gruen@gmail.com wrote:

...
Hi Mansi,

...

Most of my xqueries are of below nature

'/Archives/descendant::apiCalls[contains(@name,"com.sun")]/@name', where apiCalls could be 3-4 level under 'Archives'. Xqueries are accessed via REST

The existing index structures won’t allow you to look for arbitrary sub strings; see [1] for more information.

You are right, the full-text index may be a possibly way out. Prefix searches can be realized via the "using wildcards" option [2]:

//*[text() contains text "abc.*" using wildcards

Please note that the query string will always be "tokenized": if you are looking for "com.sun", you will also get results like "COM SUN!".

...

I have 1000s of documents, spanning over 100 XML DB, with total space

around 400 GB currently. Each query is taking roughly 30 mins, to run.

My concern is, at each DB update, I am using attribute indexing, but info command on basex prompt tells me otherwise. Am I misreading something ? Is there a way to fix this once DB is created ? Its takes me 48 hours, to create DBs from scratch... :)

If UPDINDEX and AUTOOPTIMIZE is false, you will need to call "OPTIMIZE" after your updates.

If you create a new database, you can set UPDINDEX and AUTOOPTIMIZE to true. However, AUTOOPTIMIZE will get incredibly slow if you are working with gigabytes of XML data.

...
Reading thru UPDINDEX and AUTOOPTIMIZE ALL commands, tells me to open each DB and run these commands. Is that my option ? Do we have a xquery script somewhere which I can use to do this ?

If your databases are called "db1" ... "db100", the following XQuery script will optimize all those databases:

for $i in 1 to 100 return db:optimize('db' || $i)

You can also create a command script [3] with XQuery:

<commands>{ for $i in 1 to 100 return ( <open>{ 'db' || $i }</open>, <optimize/> ) }</commands>

You can store the result as a .bxs file and run it afterwards.

Before you create all index structures, you should probably run your queries on some smaller database instances and check out the "Query Info" panel in the GUI. It will tell you if an index is used or not.

Best, Christian

[1] http://docs.basex.org/wiki/Indexes#Value_Indexes [2] http://docs.basex.org/wiki/Full-Text#Match_Options [3] http://docs.basex.org/wiki/Commands#Command_Scripts

--

Mansi

Mansi

8:03 p.m.

Ok. I will do some research and experimenting and report back my experience.

Thanks, - Mansi

...

On Jan 3, 2016, at 7:31 PM, Christian Grün christian.gruen@gmail.com wrote:

Hi Mansi,

...

I am not 100% clear, if you are motivating me towards or against FULLTEXT

indexing :)

This is something you’ll have to answer by yourself; it depends on the kind of queries and on your ability to store attribute values as texts.

...

Yes I am dealing with GBs of XML files. I create new Databases, using

JAVA API using CreateDB class. Should I be using MainOptions to set AUTOOPTIMIZE and UPDINDEX options before each new db creation ? In MainOptions class, I didn't find any auto optimize option, am I missing something ? Since, I am anyways setting options thru this method, should I also set FTINDEX or ATTRINDEX (based on your response 1) attribute as well, before creating each DB ?

As indicated, AUTOOPTIMIZE is no viable choice for data instances of that size. UPDINDEX may be a suitable, but before creating any index structures, I advise you to first do some testing with smaller instances. Only after that, you will know which index structures you need for speeding up your queries. I hope our Wiki articles on index structures and the full-text feature are helpful in that regard.

Christian

...
On Sun, Jan 3, 2016 at 4:52 PM, Christian Grün christian.gruen@gmail.com wrote:

...
Hi Mansi,

...

Most of my xqueries are of below nature

'/Archives/descendant::apiCalls[contains(@name,"com.sun")]/@name', where apiCalls could be 3-4 level under 'Archives'. Xqueries are accessed via REST

The existing index structures won’t allow you to look for arbitrary sub strings; see [1] for more information.

You are right, the full-text index may be a possibly way out. Prefix searches can be realized via the "using wildcards" option [2]:

//*[text() contains text "abc.*" using wildcards

Please note that the query string will always be "tokenized": if you are looking for "com.sun", you will also get results like "COM SUN!".

...

I have 1000s of documents, spanning over 100 XML DB, with total space

around 400 GB currently. Each query is taking roughly 30 mins, to run.

My concern is, at each DB update, I am using attribute indexing, but info command on basex prompt tells me otherwise. Am I misreading something ? Is there a way to fix this once DB is created ? Its takes me 48 hours, to create DBs from scratch... :)

If UPDINDEX and AUTOOPTIMIZE is false, you will need to call "OPTIMIZE" after your updates.

If you create a new database, you can set UPDINDEX and AUTOOPTIMIZE to true. However, AUTOOPTIMIZE will get incredibly slow if you are working with gigabytes of XML data.

...
Reading thru UPDINDEX and AUTOOPTIMIZE ALL commands, tells me to open each DB and run these commands. Is that my option ? Do we have a xquery script somewhere which I can use to do this ?

If your databases are called "db1" ... "db100", the following XQuery script will optimize all those databases:

for $i in 1 to 100 return db:optimize('db' || $i)

You can also create a command script [3] with XQuery:

<commands>{ for $i in 1 to 100 return ( <open>{ 'db' || $i }</open>, <optimize/> ) }</commands>

You can store the result as a .bxs file and run it afterwards.

Before you create all index structures, you should probably run your queries on some smaller database instances and check out the "Query Info" panel in the GUI. It will tell you if an index is used or not.

Best, Christian

[1] http://docs.basex.org/wiki/Indexes#Value_Indexes [2] http://docs.basex.org/wiki/Full-Text#Match_Options [3] http://docs.basex.org/wiki/Commands#Command_Scripts

--

Mansi

3482

Age (days ago)

3484

Last active (days ago)

basex-talk@mailman.uni-konstanz.de

14 comments

4 participants

tags (0)

participants (4)

Christian Grün
Mansi
Mansi Sheth
Ron Katriel