Re: [basex-talk] Results of some experiments for improving full-text search speeds

3 Mar 2022


      Instead of using the stopwords option, I added functions to remove the
stopwords from the tokenized text of each document, and I also remove them
from user's queries, so a search for "friends of the library" will be sent
as "friends library" and the tokenized text will also be "friends library."
In our module:
declare function aw:remove_stopwords($tokens as xs:string*, $stopwords as
xs:string+)  {
  let $new_tokens :=
    for $token in $tokens
      return
        if (not(exists(index-of($stopwords, $token)))) then $token
        else()
  return string-join($new_tokens, ' ')
};
Then what gets indexed is:
aw:remove_stopwords(ft:tokenize(string-join($ead//text(), ' ')), $stopwords)
This might not be the most efficient method, but I'm less concerned with
the speed of indexing than I am with the speed of searching.
-Tamara
On Thu, Mar 3, 2022 at 10:32 AM Tamara Marnell tmarnell@orbiscascade.org
wrote:
...
Thanks for the suggestions, Christian! I previously tried to make indexes
of string-join($ead//text(), ' '), and it didn't seem faster, so I gave up
early. I tried again and persisted until the searches did get faster.
Previously when I wrote up my results, if a search for "native" took 2
seconds, and "american" took 10 seconds, and "pottery" took 5 seconds, the
full-text search in "any" mode for "native american pottery" took 17
seconds. Searching a dedicated index of tokens instead of the original
documents, the time for searches is pretty much constant whether the query
is "cats" or "cats dogs apples bananas oranges washington state
photographs." Total speed is now affected mostly by how many records get
returned, like "photographs" on its own takes 12 seconds because it's
returning 17K records, while the ridiculous cats...photographs query takes
3 seconds because it's returning only 19 records.
The reason my initial tests were slow is because I constructed the text
index with only the whole-document strings and attributes for file paths,
then used the path in doc($file) to open the original and get other fields
for ranking and sorting. This turns a 2-second query into a 30-second
query. I put all fields I need into the same index, with FTINCLUDE on the
"tokens" node only, so I can grab them all from the results of
ft:search('text-index', $terms, map{'mode':'all'}/ancestor::ead very
quickly.
Another reason my experiments were slow is because I tried to use
ft:count() to get the number of hits in the text and use it in our ranking
calculations. This also slows down the query considerably. I switched to
using the score included in ft:search and doctoring it to boost certain
fields.
Finally I found that the stopwords option was not taking effect, so our
fulltext index was more bloated than necessary. When I set FTINDEX and
FTINCLUDE before calling CREATE DB, in queries db:optimize('text-index') is
enough. But when I set the STOPWORDS path before creation or as a global
constant in .basex, then try db:optimize() in queries, INFO INDEX shows the
top terms as "the", "a" etc. The stopwords work if I specify the option in
queries, like db:optimize('text-index', true(), map{'stopwords': $path}).
An outstanding issue: our users want to search for exact phrases by
surrounding terms in quotes. I accomplished this before stopwords were
working by splitting the terms and concatenating them with bars before
sending them to XQuery.
User query: oregon university "friends of the library" records
External variable $q: oregon|university|friends of the library|records
let $terms := tokenize($q, '|')
for $result score $basex_score in ft:search('text-index', $terms,
map{'mode':'all','fuzzy':$f})/ancestor::ead etc.
Now the term "friends of the library" has no matches. Cutting out the
stopwords beforehand and sending just "friends library" also results in no
matches.
How does ft:search() handle phrases that contain stopwords? Do I need to
somehow strip stopwords out of my tokenized strings before inserting them
in the index?
Example index entry:
  <ead ark="80444/xv60886">
    <title>Friends of the Library Records</title>
    <date>20150421</date>
    <tokens>
      orerg002 xml guide to the friends of the library records 1934 1996
oregon state university friends of the library records funding for encoding
this finding aid was provided through a grant awarded by the national
endowment for the humanities [etc.]
    </tokens>
  </ead>
-Tamara
On Mon, Feb 28, 2022 at 7:41 AM Christian Grün christian.gruen@gmail.com
wrote:
...
Hi Tamara,
Thanks a lot for sharing your interesting experiences with BaseX.
You mentioned that you are working with various custom indexes. Have
you also considered adding an auxiliary index element to your main
databases?
for $ead in db:open($db)//ead
return insert node index { ft:tokenize($ean) } into $ead,
db:optimize($db)
You could simplify then your query to something as follows:
for $db_id in tokenize($d, '|')
  for $text in ft:search($db_id, $terms, map{'mode':'all
words','fuzzy':$f})
  let $ean := $text/parent::ean update { delete node index }
  return <arg>{ $ean }</arg>
In addition,
• the size of the full-text index can additionally be reduced by
setting FTINCLUDE to this index element
• If you are not interested in word order, you could remove duplicates
via distinct-values(ft:tokenize($ean))
• As an alternative, the index strings could also be stored in a
custom index database, or at least in a distinct path; this way, there
would be no need to remove the 'index' element before returning the
result.
Some time ago, we proposed to a user to modify FTINCLUDE and index
elements instead of text nodes [1]. There was no further discussion on
that approach, but I think it would be helpful in many use cases,
including yours. Do you have an opinion about the suggestion we made?
Best,
Christian
[1]
https://www.mail-archive.com/basex-talk@mailman.uni-konstanz.de/msg12081.htm...
--
Tamara Marnell
Program Manager, Systems
Orbis Cascade Alliance (orbiscascade.org https://www.orbiscascade.org/)
Pronouns: she/her/hers
-- 

Tamara Marnell
Program Manager, Systems
Orbis Cascade Alliance (orbiscascade.org https://www.orbiscascade.org/)
Pronouns: she/her/hers

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

Re: [basex-talk] Results of some experiments for improving full-text search speeds