The indexes I’m constructing are:

Where used: For each DITA map or topic, indexed by document URI (but probably better indexed by node ID), capture the direct references to that map or topic from other maps and topics.
Document-to-bundle map: For each DITA map or topic capture the “bundle” that document is in. (Bundles are a Zoomin software concept that becomes a major organizational label for our content. A bundle is represented by a DITA map and is used as a unit of publishing to Zoomin). Determining the bundle requires walking back up the reference path from a topic or map to the DITA bundle maps that ultimately refer to the topic or map. This is an expensive process even with the where-used table, so worth persisting. In a more generalized DITA context this index could be generalized to “doc-to-root-map” index, where you provide the business logic for determining which maps are root maps (root mapness is not an intrinsic property of DITA maps).

The strategy I have working for both indexes is a single top-level document for each index that then has a flat list of index entry elements, one for each topic, i.e.,:

<doc-where-used-index>

<where-used-entry key="/pce-test-data-01/administer/tablet-mobile-ui/task/list-filter-sorting.dita" tagname="task" class="topic/topic task/task" id="list-filter-sorting">

<title>Configure sorting capabilities within mobile filters</title>

<doc>

</doc>

<xrefs>

</xrefs>

</where-used-entry>

…

</doc-where-used-index>

I then have some utility functions to resolve <noderef> elements back to nodes and the index works great.

By using single documents for the index I can use the “construct index doc and then either create DB or replace existing doc in one go” model as shown in the custom index example. Otherwise, as far as I can determine, one has to ensure that the database to hold the index already exists since you can’t create an index and then separately add to it in a single query. Alternatively, I could construct a very large sequence of individual document nodes and add those to the index as it’s created—I suspect it comes to the same thing but I haven’t tried it.

Using the where-used index to calculate the doc-to-bundle index, it takes about 50ms per topic or map to determine the bundle (on my laptop), which is still 10x slower than I’d like but certainly tolerable (at 50ms per topic it takes about 7.5 minutes to process 9400 topics). I’d like to know if there’re things I can to do reduce this time but I can take that up later—current result is more than good enough for my immediate purposes (which is to report data about the topics grouped by bundle, thus the need for the topic-to-bundle index).

From the topic-to-bundle index I can generate a JSON representation of it almost instantly by generating JSON XML and then serializing it (this JSON is then consumed by an XSLT running elsewhere, at least for now).

What I haven’t done yet is implement updating these indexes to reflect file changes from git repo updates: that should be a relatively simple application of XQuery update but I’m not sure what the performance implications are of modifying individual nodes within a single document as opposed to modifying entire documents (i.e., if I made each index entry a separate document).

I also determined that constructing an XQuery map from the index data is very slow—clearly a “don’t do that” kind of thing, while constructing a JSON XML representation of the index is very fast. Not a surprising result but worth confirming.

I’ll be refining this system as I hammer it into a server-based web application for reporting information about our entire corpus of topics as they change over time.

Cheers,

_____________________________________________

Eliot Kimber

Sr Staff Content Engineer

O: 512 554 9368

M: 512 554 9368

servicenow.com

LinkedIn | Twitter | YouTube | Facebook

From: Christian Grün <christian.gruen@gmail.com>
Date: Monday, January 24, 2022 at 6:57 AM
To: Eliot Kimber <eliot.kimber@servicenow.com>
Cc: basex-talk@mailman.uni-konstanz.de <basex-talk@mailman.uni-konstanz.de>
Subject: Re: [basex-talk] Strategy for Persisting Maps that Contain Nodes: db:node-id()

[External Email]

> My approach is to create a separate element for each index entry, rather than creating a single element that then contains all the index entries as shown in the index construction example in the docs.

You mean you don’t group the nodes by the index key, as shown in the
docs? That should be fine as well. If the entries are grouped, a
single element may get larger, but the overall number of nodes to be
added or replaced will be smaller. If single entries need to be
updated in your scenario (e.g. because the key changes), grouping
might not be the solution, though.

There are usually various solutions for achieving the same goal. The
presented example is fairly simple indeed (most or our index
structures in real-world applications are certainly more complex). I
guess that 16 GB should be more than sufficient for a 70 MB index
database, but feel free to share your experiences.