Good morning--I hope everyone in Europe is safe and well.
This week I've been working on improving the speed of full-text searches in our website, and I wanted to share what I've found to be helpful and not so helpful.
Our project: We run a website of finding aids from 47 archival repositories in the western US. Archivists submit their finding aids as EADs through a CMS that indexes each document for full-text searching in BaseX, with other custom indexes for facets like subject, material type, etc. https://archiveswest.orbiscascade.org/
Our issue: We need to perform full-text searches that match documents with all of the terms in the query. The BaseX full-text index and ft:search() function work at the text node level. So we need to use the mode "any word," then group by a unique identifier in the root and run an ft:contains() with the mode "all words," like this:
for $db_id in tokenize($d, '|') for $result in ft:search($db_id, $terms, map{'mode':'any word','fuzzy':$f}) let $ead := $result/ancestor::ead let $ark := normalize-space($ead/eadheader/eadid/@identifier) (: Other node values fetched here :) group by $ark where ft:contains(string-join($result, ' '), $terms, map{'mode':'all words','fuzzy':$f}) (: Custom ranking calculations based on other node values performed here :) return <ark>{$ark}</ark>
Most users of the site are interested in their repositories only, and they use only a few words to find known collections, so most of the time results are delivered in less than 5 seconds. If users want to search for multiple terms across all 47 repositories, though, the search was taking 30-40 seconds.
What I DIDN'T find helpful: Modifying the XQuery
1. Performing a separate search for each term and mapping to fetch with db:open-id() later, instead of searching for all terms at once and grouping. This was slightly slower:
for $db_id in tokenize($d, '|') let $results := fold-left( (: Get the ft:search() results as a map of node IDs and results :) ) for $node_id in map:keys($results) let $ead := db:open-id($db_id, $node_id) (: Ranking etc. :)
2. For multi-term queries, creating a temporary table to search instead of the complete full-text indexes of all databases. a. Create a database like temp123456 with a full-text index. b. For the first term, loop through all databases add the documents to temp123456 with db:add($temp_id, db:path($ead)) and optimize. c. For subsequent terms, map the results of a full-text search temp123456 instead of looping through all 47 databases, and then loop through all EADs in the temp database and remove the ones that don't match with... if (not(exists($temp_results(db:node-id($ead))))) then db:delete($temp_id, db:path($ead)) d. For the final term, return the usual ranked <ark /> values, then drop the database.
This didn't work because the optimization of the temporary databases took a long time--25-30 seconds for a term with 1K document results, according to the BaseX logs. It would work in theory only if the existing full-text index could be parceled out for subsequent terms, instead of being freshly rebuilt.
What I DID find helpful: Changing the server
We use Amazon Web Services, so I experimented with putting the project on different instance types: General, Memory-Optimized, and Compute-Optimized.
The contenders: General t2.medium: 2 vCPU, 4 GB memory General t3.large: 2 vCPU, 8 GB memory Compute-optimized c6g.xlarge: 4 vCPU, 8 GB memory Memory-optimized r6g.large: 2 vCPU, 16 GB memory
At the time of testing we were on t2.medium, which is the most affordable and the smallest I thought would be sufficient.
For one-term queries, the difference in speed was not that much across the instance types. For longer queries, the number of virtual CPUs and memory made a big difference.
Query: native american tribes in oregon t2.medium: 55 seconds t3.large: 30 seconds r6g.large: 22 seconds c6g.xlarge: 21 seconds
Memory: - Increasing from 4 GB on the t2.medium to 8 GB on the t3.large and c6g.xlarge nearly doubled the speed of my queries, even for queries of 2 words like "women's march" (20 seconds on t2.medium, 8-10 seconds on the others). - Increasing from 8 GB to 16 GB on a memory-optimized instance didn't affect the speed. The java process used only 10% of the available memory, versus 50-60% on other instance types, but the speed wasn't any faster, even increasing the maximum heap size in the basexhttp startup script. This could be because heap size isn't the right property to adjust.
CPU: - Increasing from 2 vCPU to 4 vCPU on a compute-optimized instance improved speeds only for extreme queries of 4 words or more.
papers t3.large: 6 seconds c6g.xlarge: 7 seconds
family papers t3.large: 18 seconds c6g.xlarge: 18 seconds
adams family papers t3.large: 15 seconds c6g.xlarge: 14 seconds
adams family papers utah t3.large: 25 seconds c6g.xlarge: 21 seconds
For most queries our users are performing, which are short and restricted to one repository, it's a negligible difference that didn't justify the increase in cost for a compute-optimized instance ($40 per month vs. $60 per month).
-Tamara
Tamara,
Thanks for posting these results—that’s really useful information. Hardware optimization is usually the easiest solution if you can afford it…
Cheers,
E.
_____________________________________________ Eliot Kimber Sr Staff Content Engineer O: 512 554 9368 M: 512 554 9368 servicenow.comhttps://www.servicenow.com LinkedInhttps://www.linkedin.com/company/servicenow | Twitterhttps://twitter.com/servicenow | YouTubehttps://www.youtube.com/user/servicenowinc | Facebookhttps://www.facebook.com/servicenow
From: BaseX-Talk basex-talk-bounces@mailman.uni-konstanz.de on behalf of Tamara Marnell tmarnell@orbiscascade.org Date: Friday, February 25, 2022 at 12:20 PM To: BaseX basex-talk@mailman.uni-konstanz.de Subject: [basex-talk] Results of some experiments for improving full-text search speeds [External Email]
Good morning--I hope everyone in Europe is safe and well.
This week I've been working on improving the speed of full-text searches in our website, and I wanted to share what I've found to be helpful and not so helpful.
Our project: We run a website of finding aids from 47 archival repositories in the western US. Archivists submit their finding aids as EADs through a CMS that indexes each document for full-text searching in BaseX, with other custom indexes for facets like subject, material type, etc. https://archiveswest.orbiscascade.org/https://urldefense.com/v3/__https:/archiveswest.orbiscascade.org/__;!!N4vogdjhuJM!VcEYkdeReXrKhwz0H5NbGs8KzPRHM6nCAy0RbPQMRHqtcBFzFoBKRDSbII8C8SOSs00_Pg$
Our issue: We need to perform full-text searches that match documents with all of the terms in the query. The BaseX full-text index and ft:search() function work at the text node level. So we need to use the mode "any word," then group by a unique identifier in the root and run an ft:contains() with the mode "all words," like this:
for $db_id in tokenize($d, '|') for $result in ft:search($db_id, $terms, map{'mode':'any word','fuzzy':$f}) let $ead := $result/ancestor::ead let $ark := normalize-space($ead/eadheader/eadid/@identifier) (: Other node values fetched here :) group by $ark where ft:contains(string-join($result, ' '), $terms, map{'mode':'all words','fuzzy':$f}) (: Custom ranking calculations based on other node values performed here :) return <ark>{$ark}</ark>
Most users of the site are interested in their repositories only, and they use only a few words to find known collections, so most of the time results are delivered in less than 5 seconds. If users want to search for multiple terms across all 47 repositories, though, the search was taking 30-40 seconds.
What I DIDN'T find helpful: Modifying the XQuery
1. Performing a separate search for each term and mapping to fetch with db:open-id() later, instead of searching for all terms at once and grouping. This was slightly slower:
for $db_id in tokenize($d, '|') let $results := fold-left( (: Get the ft:search() results as a map of node IDs and results :) ) for $node_id in map:keys($results) let $ead := db:open-id($db_id, $node_id) (: Ranking etc. :)
2. For multi-term queries, creating a temporary table to search instead of the complete full-text indexes of all databases. a. Create a database like temp123456 with a full-text index. b. For the first term, loop through all databases add the documents to temp123456 with db:add($temp_id, db:path($ead)) and optimize. c. For subsequent terms, map the results of a full-text search temp123456 instead of looping through all 47 databases, and then loop through all EADs in the temp database and remove the ones that don't match with... if (not(exists($temp_results(db:node-id($ead))))) then db:delete($temp_id, db:path($ead)) d. For the final term, return the usual ranked <ark /> values, then drop the database.
This didn't work because the optimization of the temporary databases took a long time--25-30 seconds for a term with 1K document results, according to the BaseX logs. It would work in theory only if the existing full-text index could be parceled out for subsequent terms, instead of being freshly rebuilt.
What I DID find helpful: Changing the server
We use Amazon Web Services, so I experimented with putting the project on different instance types: General, Memory-Optimized, and Compute-Optimized.
The contenders: General t2.medium: 2 vCPU, 4 GB memory General t3.large: 2 vCPU, 8 GB memory Compute-optimized c6g.xlarge: 4 vCPU, 8 GB memory Memory-optimized r6g.large: 2 vCPU, 16 GB memory
At the time of testing we were on t2.medium, which is the most affordable and the smallest I thought would be sufficient.
For one-term queries, the difference in speed was not that much across the instance types. For longer queries, the number of virtual CPUs and memory made a big difference.
Query: native american tribes in oregon t2.medium: 55 seconds t3.large: 30 seconds r6g.large: 22 seconds c6g.xlarge: 21 seconds
Memory: - Increasing from 4 GB on the t2.medium to 8 GB on the t3.large and c6g.xlarge nearly doubled the speed of my queries, even for queries of 2 words like "women's march" (20 seconds on t2.medium, 8-10 seconds on the others). - Increasing from 8 GB to 16 GB on a memory-optimized instance didn't affect the speed. The java process used only 10% of the available memory, versus 50-60% on other instance types, but the speed wasn't any faster, even increasing the maximum heap size in the basexhttp startup script. This could be because heap size isn't the right property to adjust.
CPU: - Increasing from 2 vCPU to 4 vCPU on a compute-optimized instance improved speeds only for extreme queries of 4 words or more.
papers t3.large: 6 seconds c6g.xlarge: 7 seconds
family papers t3.large: 18 seconds c6g.xlarge: 18 seconds
adams family papers t3.large: 15 seconds c6g.xlarge: 14 seconds
adams family papers utah t3.large: 25 seconds c6g.xlarge: 21 seconds
For most queries our users are performing, which are short and restricted to one repository, it's a negligible difference that didn't justify the increase in cost for a compute-optimized instance ($40 per month vs. $60 per month).
-Tamara
--
Tamara Marnell Program Manager, Systems Orbis Cascade Alliance (orbiscascade.orghttps://urldefense.com/v3/__https:/www.orbiscascade.org/__;!!N4vogdjhuJM!VcEYkdeReXrKhwz0H5NbGs8KzPRHM6nCAy0RbPQMRHqtcBFzFoBKRDSbII8C8SOpjnbx7g$) Pronouns: she/her/hers
Hi Tamara,
Thanks a lot for sharing your interesting experiences with BaseX.
You mentioned that you are working with various custom indexes. Have you also considered adding an auxiliary index element to your main databases?
for $ead in db:open($db)//ead return insert node index { ft:tokenize($ean) } into $ead, db:optimize($db)
You could simplify then your query to something as follows:
for $db_id in tokenize($d, '|') for $text in ft:search($db_id, $terms, map{'mode':'all words','fuzzy':$f}) let $ean := $text/parent::ean update { delete node index } return <arg>{ $ean }</arg>
In addition, • the size of the full-text index can additionally be reduced by setting FTINCLUDE to this index element • If you are not interested in word order, you could remove duplicates via distinct-values(ft:tokenize($ean)) • As an alternative, the index strings could also be stored in a custom index database, or at least in a distinct path; this way, there would be no need to remove the 'index' element before returning the result.
Some time ago, we proposed to a user to modify FTINCLUDE and index elements instead of text nodes [1]. There was no further discussion on that approach, but I think it would be helpful in many use cases, including yours. Do you have an opinion about the suggestion we made?
Best, Christian
[1] https://www.mail-archive.com/basex-talk@mailman.uni-konstanz.de/msg12081.htm...
Thanks for the suggestions, Christian! I previously tried to make indexes of string-join($ead//text(), ' '), and it didn't seem faster, so I gave up early. I tried again and persisted until the searches did get faster.
Previously when I wrote up my results, if a search for "native" took 2 seconds, and "american" took 10 seconds, and "pottery" took 5 seconds, the full-text search in "any" mode for "native american pottery" took 17 seconds. Searching a dedicated index of tokens instead of the original documents, the time for searches is pretty much constant whether the query is "cats" or "cats dogs apples bananas oranges washington state photographs." Total speed is now affected mostly by how many records get returned, like "photographs" on its own takes 12 seconds because it's returning 17K records, while the ridiculous cats...photographs query takes 3 seconds because it's returning only 19 records.
The reason my initial tests were slow is because I constructed the text index with only the whole-document strings and attributes for file paths, then used the path in doc($file) to open the original and get other fields for ranking and sorting. This turns a 2-second query into a 30-second query. I put all fields I need into the same index, with FTINCLUDE on the "tokens" node only, so I can grab them all from the results of ft:search('text-index', $terms, map{'mode':'all'}/ancestor::ead very quickly.
Another reason my experiments were slow is because I tried to use ft:count() to get the number of hits in the text and use it in our ranking calculations. This also slows down the query considerably. I switched to using the score included in ft:search and doctoring it to boost certain fields.
Finally I found that the stopwords option was not taking effect, so our fulltext index was more bloated than necessary. When I set FTINDEX and FTINCLUDE before calling CREATE DB, in queries db:optimize('text-index') is enough. But when I set the STOPWORDS path before creation or as a global constant in .basex, then try db:optimize() in queries, INFO INDEX shows the top terms as "the", "a" etc. The stopwords work if I specify the option in queries, like db:optimize('text-index', true(), map{'stopwords': $path}).
An outstanding issue: our users want to search for exact phrases by surrounding terms in quotes. I accomplished this before stopwords were working by splitting the terms and concatenating them with bars before sending them to XQuery.
User query: oregon university "friends of the library" records External variable $q: oregon|university|friends of the library|records let $terms := tokenize($q, '|') for $result score $basex_score in ft:search('text-index', $terms, map{'mode':'all','fuzzy':$f})/ancestor::ead etc.
Now the term "friends of the library" has no matches. Cutting out the stopwords beforehand and sending just "friends library" also results in no matches.
How does ft:search() handle phrases that contain stopwords? Do I need to somehow strip stopwords out of my tokenized strings before inserting them in the index?
Example index entry:
<ead ark="80444/xv60886"> <title>Friends of the Library Records</title> <date>20150421</date> <tokens> orerg002 xml guide to the friends of the library records 1934 1996 oregon state university friends of the library records funding for encoding this finding aid was provided through a grant awarded by the national endowment for the humanities [etc.] </tokens> </ead>
-Tamara
On Mon, Feb 28, 2022 at 7:41 AM Christian Grün christian.gruen@gmail.com wrote:
Hi Tamara,
Thanks a lot for sharing your interesting experiences with BaseX.
You mentioned that you are working with various custom indexes. Have you also considered adding an auxiliary index element to your main databases?
for $ead in db:open($db)//ead return insert node index { ft:tokenize($ean) } into $ead, db:optimize($db)
You could simplify then your query to something as follows:
for $db_id in tokenize($d, '|') for $text in ft:search($db_id, $terms, map{'mode':'all words','fuzzy':$f}) let $ean := $text/parent::ean update { delete node index } return <arg>{ $ean }</arg>
In addition, • the size of the full-text index can additionally be reduced by setting FTINCLUDE to this index element • If you are not interested in word order, you could remove duplicates via distinct-values(ft:tokenize($ean)) • As an alternative, the index strings could also be stored in a custom index database, or at least in a distinct path; this way, there would be no need to remove the 'index' element before returning the result.
Some time ago, we proposed to a user to modify FTINCLUDE and index elements instead of text nodes [1]. There was no further discussion on that approach, but I think it would be helpful in many use cases, including yours. Do you have an opinion about the suggestion we made?
Best, Christian
[1] https://www.mail-archive.com/basex-talk@mailman.uni-konstanz.de/msg12081.htm...
Instead of using the stopwords option, I added functions to remove the stopwords from the tokenized text of each document, and I also remove them from user's queries, so a search for "friends of the library" will be sent as "friends library" and the tokenized text will also be "friends library."
In our module:
declare function aw:remove_stopwords($tokens as xs:string*, $stopwords as xs:string+) { let $new_tokens := for $token in $tokens return if (not(exists(index-of($stopwords, $token)))) then $token else() return string-join($new_tokens, ' ') };
Then what gets indexed is: aw:remove_stopwords(ft:tokenize(string-join($ead//text(), ' ')), $stopwords)
This might not be the most efficient method, but I'm less concerned with the speed of indexing than I am with the speed of searching.
-Tamara
On Thu, Mar 3, 2022 at 10:32 AM Tamara Marnell tmarnell@orbiscascade.org wrote:
Thanks for the suggestions, Christian! I previously tried to make indexes of string-join($ead//text(), ' '), and it didn't seem faster, so I gave up early. I tried again and persisted until the searches did get faster.
Previously when I wrote up my results, if a search for "native" took 2 seconds, and "american" took 10 seconds, and "pottery" took 5 seconds, the full-text search in "any" mode for "native american pottery" took 17 seconds. Searching a dedicated index of tokens instead of the original documents, the time for searches is pretty much constant whether the query is "cats" or "cats dogs apples bananas oranges washington state photographs." Total speed is now affected mostly by how many records get returned, like "photographs" on its own takes 12 seconds because it's returning 17K records, while the ridiculous cats...photographs query takes 3 seconds because it's returning only 19 records.
The reason my initial tests were slow is because I constructed the text index with only the whole-document strings and attributes for file paths, then used the path in doc($file) to open the original and get other fields for ranking and sorting. This turns a 2-second query into a 30-second query. I put all fields I need into the same index, with FTINCLUDE on the "tokens" node only, so I can grab them all from the results of ft:search('text-index', $terms, map{'mode':'all'}/ancestor::ead very quickly.
Another reason my experiments were slow is because I tried to use ft:count() to get the number of hits in the text and use it in our ranking calculations. This also slows down the query considerably. I switched to using the score included in ft:search and doctoring it to boost certain fields.
Finally I found that the stopwords option was not taking effect, so our fulltext index was more bloated than necessary. When I set FTINDEX and FTINCLUDE before calling CREATE DB, in queries db:optimize('text-index') is enough. But when I set the STOPWORDS path before creation or as a global constant in .basex, then try db:optimize() in queries, INFO INDEX shows the top terms as "the", "a" etc. The stopwords work if I specify the option in queries, like db:optimize('text-index', true(), map{'stopwords': $path}).
An outstanding issue: our users want to search for exact phrases by surrounding terms in quotes. I accomplished this before stopwords were working by splitting the terms and concatenating them with bars before sending them to XQuery.
User query: oregon university "friends of the library" records External variable $q: oregon|university|friends of the library|records let $terms := tokenize($q, '|') for $result score $basex_score in ft:search('text-index', $terms, map{'mode':'all','fuzzy':$f})/ancestor::ead etc.
Now the term "friends of the library" has no matches. Cutting out the stopwords beforehand and sending just "friends library" also results in no matches.
How does ft:search() handle phrases that contain stopwords? Do I need to somehow strip stopwords out of my tokenized strings before inserting them in the index?
Example index entry:
<ead ark="80444/xv60886"> <title>Friends of the Library Records</title> <date>20150421</date> <tokens> orerg002 xml guide to the friends of the library records 1934 1996 oregon state university friends of the library records funding for encoding this finding aid was provided through a grant awarded by the national endowment for the humanities [etc.] </tokens> </ead>
-Tamara
On Mon, Feb 28, 2022 at 7:41 AM Christian Grün christian.gruen@gmail.com wrote:
Hi Tamara,
Thanks a lot for sharing your interesting experiences with BaseX.
You mentioned that you are working with various custom indexes. Have you also considered adding an auxiliary index element to your main databases?
for $ead in db:open($db)//ead return insert node index { ft:tokenize($ean) } into $ead, db:optimize($db)
You could simplify then your query to something as follows:
for $db_id in tokenize($d, '|') for $text in ft:search($db_id, $terms, map{'mode':'all words','fuzzy':$f}) let $ean := $text/parent::ean update { delete node index } return <arg>{ $ean }</arg>
In addition, • the size of the full-text index can additionally be reduced by setting FTINCLUDE to this index element • If you are not interested in word order, you could remove duplicates via distinct-values(ft:tokenize($ean)) • As an alternative, the index strings could also be stored in a custom index database, or at least in a distinct path; this way, there would be no need to remove the 'index' element before returning the result.
Some time ago, we proposed to a user to modify FTINCLUDE and index elements instead of text nodes [1]. There was no further discussion on that approach, but I think it would be helpful in many use cases, including yours. Do you have an opinion about the suggestion we made?
Best, Christian
[1] https://www.mail-archive.com/basex-talk@mailman.uni-konstanz.de/msg12081.htm...
--
Tamara Marnell Program Manager, Systems Orbis Cascade Alliance (orbiscascade.org https://www.orbiscascade.org/) Pronouns: she/her/hers
Hi Tamara,
I assume that many of my thoughts are already known to you, so simply skip them in them just in case:
While ft:search is pretty fast, it’s often the subsequent traversal of ancestor steps that consumes most of the time. In some cases, it can already make a difference if you use "parent::specific-name" or ".."). That would be something we’d attempt to tackle with the proposal I indicated in my last reply, and which we’ll definitely pursue further.
If you have numerous databases, it might also be the initial opening of the ~50 databases that takes additional time. That could be tackled by storing the index entries for all databases in one separate index database, referencing the names and ids of the original databases, and addressing them with db:open-id (I assume that’s how your other custom indexes are working).
What’s the total size of your 47 databases?
Finally I found that the stopwords option was not taking effect, so
our fulltext index was more bloated than necessary. When I set FTINDEX and FTINCLUDE before calling CREATE DB, in queries db:optimize('text-index') is enough. But when I set the STOPWORDS path before creation or as a global constant in .basex, then try db:optimize() in queries, INFO INDEX shows the top terms as "the", "a" etc. The stopwords work if I specify the option in queries, like db:optimize('text-index', true(), map{'stopwords': $path}).
True; I think that should be better documented, or we could think about storing original stopword files along with database (as they might get lost otherwise).
How does ft:search() handle phrases that contain stopwords?
To be honest, I never quite understood the rationale behind the XQFT semantics, and I need to remember every time I use the feature: Stopwords are not supposed to be considered when comparing texts, but the original positions will be preserved. This means that – for your little example and without full-text index – the following query …
//title[. contains text 'Friends of the Library' using stop words at 'stopwords.txt']
… will yield results if "the" and "of" is contained in your stopword list. The next query will also yields results …
//title[. contains text 'Friends the of Library' using stop words at 'stopwords.txt']
…but the following one won’t:
//title[. contains text 'Friends of Library' using stop words at 'stopwords.txt']
All this accounts for the fact that we’ve never fully embedded support for stopwords in our own functions (ft:search, ft:contains), and it’s a good approach to remove stopwords by yourself before indexing and querying data. If we decide to improve the support in a future version, we will ignore some rules of the official specification and make things more intuitive.
This might not be the most efficient method, but I'm less concerned with the speed of indexing than I am with the speed of searching.
…and if all other performance issues have been settled, you could optimize this as follows:
declare function local:remove_stopwords($tokens as xs:string*, $stopwords as xs:string+) { string-join($tokens[not(. = $stopwords)], ' ') };
Hope this helps, Christian
Thanks for the detailed explanations and the more efficient function, Christian!
Yes, for our other indexes I have only one database representing all repositories. For the text I was concerned one index would be too big, because some of our finding aids are 2,000 pages when printed out, so I made 47 text indexes for the 47 document databases. But this morning I tested creating a single index, and the size isn't a problem. It is slightly faster, too. There's a minor tradeoff between faster times querying all databases vs faster times querying one database, which is what most of our users want to do, but they're differences of less than a second.
47 databases "text" || $id Time to query Washington apple farm across all databases: ~2.67 seconds Time to query Washington apple farm in the largest database: ~0.77 seconds Time to query Washington apple farm in an average-sized database: ~0.25 seconds Average documents: 919 (largest repository 7,616; smallest 1) Average size of index database: 11,306,034 Average full-text index size: 6,367 kB Average entries: 34,631 XQuery: for $db_id in tokenize($d, '|') for $result score $basex_score in ft:search('text' || $db_id, $terms, map{'mode':'all','fuzzy':$f})/ancestor::ead let $ark := string($result/@ark) [etc.]
Single database "index-text" Time to query Washington apple farm across all databases: ~1.75 seconds Time to query Washington apple farm in the largest database: ~0.94 seconds Time to query Washington apple farm in an average-sized database: ~0.10 seconds Represents 44,091 documents Size of index database: 595,687,360 Full-text index: 344 MB Entries: 599,405 XQuery: let $dbs := tokenize($d, '|') for $result score $basex_score in ft:search('index-text', $terms, map{'mode':'all','fuzzy':$f})/ancestor::ead where $result/@db=$dbs let $ark := string($result/@ark) [etc.]
I also compared moving up the index entries with /parent::tokens/parent::ead instead of /ancestor::ead and didn't see a difference. I'm sure it would make a big difference in our original documents with tons of nested nodes, but not so much in the condensed text index.
-Tamara
On Fri, Mar 4, 2022 at 2:47 AM Christian Grün christian.gruen@gmail.com wrote:
Hi Tamara,
I assume that many of my thoughts are already known to you, so simply skip them in them just in case:
While ft:search is pretty fast, it’s often the subsequent traversal of ancestor steps that consumes most of the time. In some cases, it can already make a difference if you use "parent::specific-name" or ".."). That would be something we’d attempt to tackle with the proposal I indicated in my last reply, and which we’ll definitely pursue further.
If you have numerous databases, it might also be the initial opening of the ~50 databases that takes additional time. That could be tackled by storing the index entries for all databases in one separate index database, referencing the names and ids of the original databases, and addressing them with db:open-id (I assume that’s how your other custom indexes are working).
What’s the total size of your 47 databases?
Finally I found that the stopwords option was not taking effect, so
our fulltext index was more bloated than necessary. When I set FTINDEX and FTINCLUDE before calling CREATE DB, in queries db:optimize('text-index') is enough. But when I set the STOPWORDS path before creation or as a global constant in .basex, then try db:optimize() in queries, INFO INDEX shows the top terms as "the", "a" etc. The stopwords work if I specify the option in queries, like db:optimize('text-index', true(), map{'stopwords': $path}).
True; I think that should be better documented, or we could think about storing original stopword files along with database (as they might get lost otherwise).
How does ft:search() handle phrases that contain stopwords?
To be honest, I never quite understood the rationale behind the XQFT semantics, and I need to remember every time I use the feature: Stopwords are not supposed to be considered when comparing texts, but the original positions will be preserved. This means that – for your little example and without full-text index – the following query …
//title[. contains text 'Friends of the Library' using stop words at 'stopwords.txt']
… will yield results if "the" and "of" is contained in your stopword list. The next query will also yields results …
//title[. contains text 'Friends the of Library' using stop words at 'stopwords.txt']
…but the following one won’t:
//title[. contains text 'Friends of Library' using stop words at 'stopwords.txt']
All this accounts for the fact that we’ve never fully embedded support for stopwords in our own functions (ft:search, ft:contains), and it’s a good approach to remove stopwords by yourself before indexing and querying data. If we decide to improve the support in a future version, we will ignore some rules of the official specification and make things more intuitive.
This might not be the most efficient method, but I'm less concerned with
the speed of indexing than I am with the speed of searching.
…and if all other performance issues have been settled, you could optimize this as follows:
declare function local:remove_stopwords($tokens as xs:string*, $stopwords as xs:string+) { string-join($tokens[not(. = $stopwords)], ' ') };
Hope this helps, Christian
Thanks for sharing your performance tests! I have just created a new GitHub issue for my thoughts on indexing specific full-text elements [1].
[1] https://github.com/BaseXdb/basex/issues/2079
On Fri, Mar 4, 2022 at 9:48 PM Tamara Marnell tmarnell@orbiscascade.org wrote:
Thanks for the detailed explanations and the more efficient function, Christian!
Yes, for our other indexes I have only one database representing all repositories. For the text I was concerned one index would be too big, because some of our finding aids are 2,000 pages when printed out, so I made 47 text indexes for the 47 document databases. But this morning I tested creating a single index, and the size isn't a problem. It is slightly faster, too. There's a minor tradeoff between faster times querying all databases vs faster times querying one database, which is what most of our users want to do, but they're differences of less than a second.
47 databases "text" || $id Time to query Washington apple farm across all databases: ~2.67 seconds Time to query Washington apple farm in the largest database: ~0.77 seconds Time to query Washington apple farm in an average-sized database: ~0.25 seconds Average documents: 919 (largest repository 7,616; smallest 1) Average size of index database: 11,306,034 Average full-text index size: 6,367 kB Average entries: 34,631 XQuery: for $db_id in tokenize($d, '|') for $result score $basex_score in ft:search('text' || $db_id, $terms, map{'mode':'all','fuzzy':$f})/ancestor::ead let $ark := string($result/@ark) [etc.]
Single database "index-text" Time to query Washington apple farm across all databases: ~1.75 seconds Time to query Washington apple farm in the largest database: ~0.94 seconds Time to query Washington apple farm in an average-sized database: ~0.10 seconds Represents 44,091 documents Size of index database: 595,687,360 Full-text index: 344 MB Entries: 599,405 XQuery: let $dbs := tokenize($d, '|') for $result score $basex_score in ft:search('index-text', $terms, map{'mode':'all','fuzzy':$f})/ancestor::ead where $result/@db=$dbs let $ark := string($result/@ark) [etc.]
I also compared moving up the index entries with /parent::tokens/parent::ead instead of /ancestor::ead and didn't see a difference. I'm sure it would make a big difference in our original documents with tons of nested nodes, but not so much in the condensed text index.
-Tamara
On Fri, Mar 4, 2022 at 2:47 AM Christian Grün christian.gruen@gmail.com wrote:
Hi Tamara,
I assume that many of my thoughts are already known to you, so simply skip them in them just in case:
While ft:search is pretty fast, it’s often the subsequent traversal of ancestor steps that consumes most of the time. In some cases, it can already make a difference if you use "parent::specific-name" or ".."). That would be something we’d attempt to tackle with the proposal I indicated in my last reply, and which we’ll definitely pursue further.
If you have numerous databases, it might also be the initial opening of the ~50 databases that takes additional time. That could be tackled by storing the index entries for all databases in one separate index database, referencing the names and ids of the original databases, and addressing them with db:open-id (I assume that’s how your other custom indexes are working).
What’s the total size of your 47 databases?
Finally I found that the stopwords option was not taking effect, so
our fulltext index was more bloated than necessary. When I set FTINDEX and FTINCLUDE before calling CREATE DB, in queries db:optimize('text-index') is enough. But when I set the STOPWORDS path before creation or as a global constant in .basex, then try db:optimize() in queries, INFO INDEX shows the top terms as "the", "a" etc. The stopwords work if I specify the option in queries, like db:optimize('text-index', true(), map{'stopwords': $path}).
True; I think that should be better documented, or we could think about storing original stopword files along with database (as they might get lost otherwise).
How does ft:search() handle phrases that contain stopwords?
To be honest, I never quite understood the rationale behind the XQFT semantics, and I need to remember every time I use the feature: Stopwords are not supposed to be considered when comparing texts, but the original positions will be preserved. This means that – for your little example and without full-text index – the following query …
//title[. contains text 'Friends of the Library' using stop words at 'stopwords.txt']
… will yield results if "the" and "of" is contained in your stopword list. The next query will also yields results …
//title[. contains text 'Friends the of Library' using stop words at 'stopwords.txt']
…but the following one won’t:
//title[. contains text 'Friends of Library' using stop words at 'stopwords.txt']
All this accounts for the fact that we’ve never fully embedded support for stopwords in our own functions (ft:search, ft:contains), and it’s a good approach to remove stopwords by yourself before indexing and querying data. If we decide to improve the support in a future version, we will ignore some rules of the official specification and make things more intuitive.
This might not be the most efficient method, but I'm less concerned with the speed of indexing than I am with the speed of searching.
…and if all other performance issues have been settled, you could optimize this as follows:
declare function local:remove_stopwords($tokens as xs:string*, $stopwords as xs:string+) { string-join($tokens[not(. = $stopwords)], ' ') };
Hope this helps, Christian
--
Tamara Marnell Program Manager, Systems Orbis Cascade Alliance (orbiscascade.org) Pronouns: she/her/hers
basex-talk@mailman.uni-konstanz.de