The indexes I’m constructing are:
The strategy I have working for both indexes is a single top-level document for each index that then has a flat list of index entry elements, one for each topic, i.e.,:
<doc-where-used-index>
<where-used-entry key="/pce-test-data-01/administer/tablet-mobile-ui/task/list-filter-sorting.dita" tagname="task" class="topic/topic task/task" id="list-filter-sorting">
<title>Configure sorting capabilities within mobile filters</title>
<conrefs/>
<topicrefs/>
<doc>
<noderef node-id="2493717" database="pce-test-data-01" tagname="task" baseuri="/pce-test-data-01/administer/tablet-mobile-ui/task/list-filter-sorting.dita"/>
</doc>
<xrefs>
<noderef node-id="2476418" database="pce-test-data-01" tagname="xref" baseuri="/pce-test-data-01/administer/tablet-mobile-ui/concept/mobile-list-filters.dita" href="../task/list-filter-sorting.dita"/>
</xrefs>
</where-used-entry>
…
</doc-where-used-index>
I then have some utility functions to resolve <noderef> elements back to nodes and the index works great.
By using single documents for the index I can use the “construct index doc and then either create DB or replace existing doc in one go” model as shown in the custom index example. Otherwise, as far as I can
determine, one has to ensure that the database to hold the index already exists since you can’t create an index and then separately add to it in a single query. Alternatively, I could construct a very large sequence of individual document nodes and add those
to the index as it’s created—I suspect it comes to the same thing but I haven’t tried it.
Using the where-used index to calculate the doc-to-bundle index, it takes about 50ms per topic or map to determine the bundle (on my laptop), which is still 10x slower than I’d like but certainly tolerable
(at 50ms per topic it takes about 7.5 minutes to process 9400 topics). I’d like to know if there’re things I can to do reduce this time but I can take that up later—current result is more than good enough for my immediate purposes (which is to report data
about the topics grouped by bundle, thus the need for the topic-to-bundle index).
From the topic-to-bundle index I can generate a JSON representation of it almost instantly by generating JSON XML and then serializing it (this JSON is then consumed by an XSLT running elsewhere, at least
for now).
What I haven’t done yet is implement updating these indexes to reflect file changes from git repo updates: that should be a relatively simple application of XQuery update but I’m not sure what the performance
implications are of modifying individual nodes within a single document as opposed to modifying entire documents (i.e., if I made each index entry a separate document).
I also determined that constructing an XQuery map from the index data is very slow—clearly a “don’t do that” kind of thing, while constructing a JSON XML representation of the index is very fast. Not a surprising
result but worth confirming.
I’ll be refining this system as I hammer it into a server-based web application for reporting information about our entire corpus of topics as they change over time.
Cheers,
E.
_____________________________________________
Eliot Kimber
Sr Staff Content Engineer
O: 512 554 9368
M: 512 554 9368
LinkedIn | Twitter | YouTube | Facebook
From:
Christian Grün <christian.gruen@gmail.com>
Date: Monday, January 24, 2022 at 6:57 AM
To: Eliot Kimber <eliot.kimber@servicenow.com>
Cc: basex-talk@mailman.uni-konstanz.de <basex-talk@mailman.uni-konstanz.de>
Subject: Re: [basex-talk] Strategy for Persisting Maps that Contain Nodes: db:node-id()
[External Email]
> My approach is to create a separate element for each index entry, rather than creating a single element that then contains all the index entries as shown in the index construction example in the docs.
You mean you don’t group the nodes by the index key, as shown in the
docs? That should be fine as well. If the entries are grouped, a
single element may get larger, but the overall number of nodes to be
added or replaced will be smaller. If single entries need to be
updated in your scenario (e.g. because the key changes), grouping
might not be the solution, though.
There are usually various solutions for achieving the same goal. The
presented example is fairly simple indeed (most or our index
structures in real-world applications are certainly more complex). I
guess that 16 GB should be more than sufficient for a 70 MB index
database, but feel free to share your experiences.