Thanks for the detailed explanations and the more efficient function, Christian!

Yes, for our other indexes I have only one database representing all repositories. For the text I was concerned one index would be too big, because some of our finding aids are 2,000 pages when printed out, so I made 47 text indexes for the 47 document databases. But this morning I tested creating a single index, and the size isn't a problem. It is slightly faster, too. There's a minor tradeoff between faster times querying all databases vs faster times querying one database, which is what most of our users want to do, but they're differences of less than a second.

47 databases "text" || $id
Time to query Washington apple farm across all databases: ~2.67 seconds
Time to query Washington apple farm in the largest database: ~0.77 seconds
Time to query Washington apple farm in an average-sized database: ~0.25 seconds
Average documents: 919 (largest repository 7,616; smallest 1)
Average size of index database: 11,306,034
Average full-text index size: 6,367 kB
Average entries: 34,631
XQuery:
for $db_id in tokenize($d, '\|')
    for $result score $basex_score in ft:search('text' || $db_id, $terms, map{'mode':'all','fuzzy':$f})/ancestor::ead
      let $ark := string($result/@ark) [etc.]

Single database "index-text"
Time to query Washington apple farm across all databases: ~1.75 seconds
Time to query Washington apple farm in the largest database: ~0.94 seconds
Time to query Washington apple farm in an average-sized database: ~0.10 seconds
Represents 44,091 documents
Size of index database: 595,687,360
Full-text index: 344 MB
Entries: 599,405
XQuery:
let $dbs := tokenize($d, '\|')
for $result score $basex_score in ft:search('index-text', $terms, map{'mode':'all','fuzzy':$f})/ancestor::ead
    where $result/@db=$dbs
      let $ark := string($result/@ark) [etc.]

I also compared moving up the index entries with /parent::tokens/parent::ead instead of /ancestor::ead and didn't see a difference. I'm sure it would make a big difference in our original documents with tons of nested nodes, but not so much in the condensed text index.

-Tamara


On Fri, Mar 4, 2022 at 2:47 AM Christian Grün <christian.gruen@gmail.com> wrote:
Hi Tamara,

I assume that many of my thoughts are already known to you, so simply
skip them in them just in case:

While ft:search is pretty fast, it’s often the subsequent traversal of
ancestor steps that consumes most of the time. In some cases, it can
already make a difference if you use "parent::specific-name" or "..").
That would be something we’d attempt to tackle with the proposal I
indicated in my last reply, and which we’ll definitely pursue further.

If you have numerous databases, it might also be the initial opening
of the ~50 databases that takes additional time. That could be tackled
by storing the index entries for all databases in one separate index
database, referencing the names and ids of the original databases, and
addressing them with db:open-id (I assume that’s how your other custom
indexes are working).

What’s the total size of your 47 databases?

 > Finally I found that the stopwords option was not taking effect, so
our fulltext index was more bloated than necessary. When I set FTINDEX
and FTINCLUDE before calling CREATE DB, in queries
db:optimize('text-index') is enough. But when I set the STOPWORDS path
before creation or as a global constant in .basex, then try
db:optimize() in queries, INFO INDEX shows the top terms as "the", "a"
etc. The stopwords work if I specify the option in queries, like
db:optimize('text-index', true(), map{'stopwords': $path}).

True; I think that should be better documented, or we could think
about storing original stopword files along with database (as they
might get lost otherwise).

> How does ft:search() handle phrases that contain stopwords?

To be honest, I never quite understood the rationale behind the XQFT
semantics, and I need to remember every time I use the feature:
Stopwords are not supposed to be considered when comparing texts, but
the original positions will be preserved. This means that – for your
little example and without full-text index – the following query …

//title[. contains text 'Friends of the Library' using stop words at
'stopwords.txt']

… will yield results if "the" and "of" is contained in your stopword
list. The next query will also yields results …

//title[. contains text 'Friends the of Library' using stop words at
'stopwords.txt']

…but the following one won’t:

//title[. contains text 'Friends of Library' using stop words at
'stopwords.txt']

All this accounts for the fact that we’ve never fully embedded support
for stopwords in our own functions (ft:search, ft:contains), and it’s
a good approach to remove stopwords by yourself before indexing and
querying data. If we decide to improve the support in a future
version, we will ignore some rules of the official specification and
make things more intuitive.

> This might not be the most efficient method, but I'm less concerned with the speed of indexing than I am with the speed of searching.

…and if all other performance issues have been settled, you could
optimize this as follows:

declare function local:remove_stopwords($tokens as xs:string*,
$stopwords as xs:string+)  {
  string-join($tokens[not(. = $stopwords)], ' ')
};

Hope this helps,
Christian


--

Tamara Marnell
Program Manager, Systems
Orbis Cascade Alliance (orbiscascade.org)
Pronouns: she/her/hers