Instead of using the stopwords option, I added functions to remove the stopwords from the tokenized text of each document, and I also remove them from user's queries, so a search for "friends of the library" will be sent as "friends library" and the tokenized text will also be "friends library."
In our module:
declare function aw:remove_stopwords($tokens as xs:string*, $stopwords as xs:string+) { let $new_tokens := for $token in $tokens return if (not(exists(index-of($stopwords, $token)))) then $token else() return string-join($new_tokens, ' ') };
Then what gets indexed is: aw:remove_stopwords(ft:tokenize(string-join($ead//text(), ' ')), $stopwords)
This might not be the most efficient method, but I'm less concerned with the speed of indexing than I am with the speed of searching.
-Tamara
On Thu, Mar 3, 2022 at 10:32 AM Tamara Marnell tmarnell@orbiscascade.org wrote:
Thanks for the suggestions, Christian! I previously tried to make indexes of string-join($ead//text(), ' '), and it didn't seem faster, so I gave up early. I tried again and persisted until the searches did get faster.
Previously when I wrote up my results, if a search for "native" took 2 seconds, and "american" took 10 seconds, and "pottery" took 5 seconds, the full-text search in "any" mode for "native american pottery" took 17 seconds. Searching a dedicated index of tokens instead of the original documents, the time for searches is pretty much constant whether the query is "cats" or "cats dogs apples bananas oranges washington state photographs." Total speed is now affected mostly by how many records get returned, like "photographs" on its own takes 12 seconds because it's returning 17K records, while the ridiculous cats...photographs query takes 3 seconds because it's returning only 19 records.
The reason my initial tests were slow is because I constructed the text index with only the whole-document strings and attributes for file paths, then used the path in doc($file) to open the original and get other fields for ranking and sorting. This turns a 2-second query into a 30-second query. I put all fields I need into the same index, with FTINCLUDE on the "tokens" node only, so I can grab them all from the results of ft:search('text-index', $terms, map{'mode':'all'}/ancestor::ead very quickly.
Another reason my experiments were slow is because I tried to use ft:count() to get the number of hits in the text and use it in our ranking calculations. This also slows down the query considerably. I switched to using the score included in ft:search and doctoring it to boost certain fields.
Finally I found that the stopwords option was not taking effect, so our fulltext index was more bloated than necessary. When I set FTINDEX and FTINCLUDE before calling CREATE DB, in queries db:optimize('text-index') is enough. But when I set the STOPWORDS path before creation or as a global constant in .basex, then try db:optimize() in queries, INFO INDEX shows the top terms as "the", "a" etc. The stopwords work if I specify the option in queries, like db:optimize('text-index', true(), map{'stopwords': $path}).
An outstanding issue: our users want to search for exact phrases by surrounding terms in quotes. I accomplished this before stopwords were working by splitting the terms and concatenating them with bars before sending them to XQuery.
User query: oregon university "friends of the library" records External variable $q: oregon|university|friends of the library|records let $terms := tokenize($q, '|') for $result score $basex_score in ft:search('text-index', $terms, map{'mode':'all','fuzzy':$f})/ancestor::ead etc.
Now the term "friends of the library" has no matches. Cutting out the stopwords beforehand and sending just "friends library" also results in no matches.
How does ft:search() handle phrases that contain stopwords? Do I need to somehow strip stopwords out of my tokenized strings before inserting them in the index?
Example index entry:
<ead ark="80444/xv60886"> <title>Friends of the Library Records</title> <date>20150421</date> <tokens> orerg002 xml guide to the friends of the library records 1934 1996 oregon state university friends of the library records funding for encoding this finding aid was provided through a grant awarded by the national endowment for the humanities [etc.] </tokens> </ead>
-Tamara
On Mon, Feb 28, 2022 at 7:41 AM Christian Grün christian.gruen@gmail.com wrote:
Hi Tamara,
Thanks a lot for sharing your interesting experiences with BaseX.
You mentioned that you are working with various custom indexes. Have you also considered adding an auxiliary index element to your main databases?
for $ead in db:open($db)//ead return insert node index { ft:tokenize($ean) } into $ead, db:optimize($db)
You could simplify then your query to something as follows:
for $db_id in tokenize($d, '|') for $text in ft:search($db_id, $terms, map{'mode':'all words','fuzzy':$f}) let $ean := $text/parent::ean update { delete node index } return <arg>{ $ean }</arg>
In addition, • the size of the full-text index can additionally be reduced by setting FTINCLUDE to this index element • If you are not interested in word order, you could remove duplicates via distinct-values(ft:tokenize($ean)) • As an alternative, the index strings could also be stored in a custom index database, or at least in a distinct path; this way, there would be no need to remove the 'index' element before returning the result.
Some time ago, we proposed to a user to modify FTINCLUDE and index elements instead of text nodes [1]. There was no further discussion on that approach, but I think it would be helpful in many use cases, including yours. Do you have an opinion about the suggestion we made?
Best, Christian
[1] https://www.mail-archive.com/basex-talk@mailman.uni-konstanz.de/msg12081.htm...
--
Tamara Marnell Program Manager, Systems Orbis Cascade Alliance (orbiscascade.org https://www.orbiscascade.org/) Pronouns: she/her/hers