In the context of my 40K topic DITA corpus, I’m trying to build a “where used” report that finds, for each topic, the other topics that directly refer to the topic. I can do this by looking for the target topic’s filename in the values of @href attributes in other topics (I’m taking advantage of a local rule we have where all topic filenames should be unique).
My current naive approach is simply:
$topics//*[tokenize(@href, '/') = $filename]
Where $topics is the 40K topics.
Based on profiling, the use of tokenize() is slightly faster than either matches() or contains(), but all forms take about 0.5 seconds per target topic, which is way too slow to make this practical in practice.
So I’m trying to work out what my performance optimization strategies are in BaseX.
In MarkLogic I would set up an index so I could do fast lookup of tokens in @href values or something similar (it’s been long enough since I had to optimize MarkLogic queries that I don’t remember the details but basically indexes for everything).
I know I could do a one-time construction of the where-used table and then use that for quick lookup for subsequent queries but I’m trying to find a solution that is more appropriate for my current “create a new database with the latest files from git and run some queries quickly to get a report” mode.
I suspect that using full-text indexing may be a solution here but wondering what other performance optimization options I have for this kind of look up.
Thinking about it now I definitely need to see if building the where-used table would actually be slower. That is, find every @href, resolve it and construct a map of topics to href elements that point to that topic. Hmm.
Anyway, any guidance on this challenge would be appreciated.
Cheers,
Eliot
_____________________________________________ Eliot Kimber Sr Staff Content Engineer O: 512 554 9368 M: 512 554 9368 servicenow.comhttps://www.servicenow.com LinkedInhttps://www.linkedin.com/company/servicenow | Twitterhttps://twitter.com/servicenow | YouTubehttps://www.youtube.com/user/servicenowinc | Facebookhttps://www.facebook.com/servicenow
On Fri, 2022-01-14 at 15:41 +0000, Eliot Kimber wrote:
$topics//*[tokenize(@href, '/') = $filename]
Is this really, ends-with(@href, $filename) ?
It can’t be ends-with() because there might be a fragment identifier in the @href value.
Cheers,
E.
_____________________________________________ Eliot Kimber Sr Staff Content Engineer O: 512 554 9368 M: 512 554 9368 servicenow.comhttps://www.servicenow.com LinkedInhttps://www.linkedin.com/company/servicenow | Twitterhttps://twitter.com/servicenow | YouTubehttps://www.youtube.com/user/servicenowinc | Facebookhttps://www.facebook.com/servicenow
From: Liam R. E. Quin liam@fromoldbooks.org Date: Friday, January 14, 2022 at 10:14 AM To: Eliot Kimber eliot.kimber@servicenow.com, basex-talk@mailman.uni-konstanz.de basex-talk@mailman.uni-konstanz.de Subject: Re: [basex-talk] Where-Used: Performance Improvement Strategies? [External Email]
On Fri, 2022-01-14 at 15:41 +0000, Eliot Kimber wrote:
$topics//*[tokenize(@href, '/') = $filename]
Is this really, ends-with(@href, $filename) ?
-- Liam Quin, https://urldefense.com/v3/__https://www.delightfulcomputing.com/__;!!N4vogdj...https://urldefense.com/v3/__https:/www.delightfulcomputing.com/__;!!N4vogdjhuJM!SWefjHp1QDMTJfLSIZHicmuDUqM7GEX7HxoJl5vbZh51kMnsyOQqlLyOsrfYiZJp42uOnw$ Available for XML/Document/Information Architecture/XSLT/ XSL/XQuery/Web/Text Processing/A11Y training, work & consulting. Barefoot Web-slave, antique illustrations: https://urldefense.com/v3/__http://www.fromoldbooks.org__;!!N4vogdjhuJM!SWef...https://urldefense.com/v3/__http:/www.fromoldbooks.org__;!!N4vogdjhuJM!SWefjHp1QDMTJfLSIZHicmuDUqM7GEX7HxoJl5vbZh51kMnsyOQqlLyOsrfYiZJv_3YRMg$
But it can be ends-with(@href,concat('/',$filename))
Although the slash should probably be the system property file separator if we might mean a file path file name. On Jan 14, 2022, 11:15 -0500, Eliot Kimber eliot.kimber@servicenow.com, wrote:
It can’t be ends-with() because there might be a fragment identifier in the @href value. Cheers, E. _____________________________________________Eliot KimberSr Staff Content EngineerO: 512 554 9368M: 512 554 9368servicenow.comLinkedIn | Twitter | YouTube | Facebook From: Liam R. E. Quin liam@fromoldbooks.orgDate: Friday, January 14, 2022 at 10:14 AMTo: Eliot Kimber eliot.kimber@servicenow.com, basex-talk@mailman.uni-konstanz.de basex-talk@mailman.uni-konstanz.deSubject: Re: [basex-talk] Where-Used: Performance Improvement Strategies?[External Email]
On Fri, 2022-01-14 at 15:41 +0000, Eliot Kimber wrote:>>> $topics//*[tokenize(@href, '/') = $filename] Is this really, ends-with(@href, $filename) ? --Liam Quin, https://urldefense.com/v3/__https://www.delightfulcomputing.com/__;!!N4vogdj... for XML/Document/Information Architecture/XSLT/XSL/XQuery/Web/Text Processing/A11Y training, work & consulting.Barefoot Web-slave, antique illustrations: https://urldefense.com/v3/__http://www.fromoldbooks.org__;!!N4vogdjhuJM!SWef...
Hi Eliot,
in similar cases, I've learned that building temporary maps is really fast.
So, instead of doing the retrieval and filtering in one step, I just construct a map with a convenient key.
In the example, I want a list of categories for articles that could exist in multiple sections (of a web site).
In a later step, I will just consult the map for the categories.
let $category-map := map:merge( for $a in $all-sections//ProductItem let $guid := $a/@Guid group by $guid return map:entry($guid, <categories>{ let $cats := for $s in $a/parent::*/parent::Section return $s/ShopCategoryId/text() for $cat in distinct-values($cats) return <_><id>{$cat}</id></_> }</categories> ) )
Best, Max
Am Fr., 14. Jan. 2022 um 16:41 Uhr schrieb Eliot Kimber eliot.kimber@servicenow.com:
In the context of my 40K topic DITA corpus, I’m trying to build a “where used” report that finds, for each topic, the other topics that directly refer to the topic. I can do this by looking for the target topic’s filename in the values of @href attributes in other topics (I’m taking advantage of a local rule we have where all topic filenames should be unique).
My current naive approach is simply:
$topics//*[tokenize(@href, '/') = $filename]
Where $topics is the 40K topics.
Based on profiling, the use of tokenize() is slightly faster than either matches() or contains(), but all forms take about 0.5 seconds per target topic, which is way too slow to make this practical in practice.
So I’m trying to work out what my performance optimization strategies are in BaseX.
In MarkLogic I would set up an index so I could do fast lookup of tokens in @href values or something similar (it’s been long enough since I had to optimize MarkLogic queries that I don’t remember the details but basically indexes for everything).
I know I could do a one-time construction of the where-used table and then use that for quick lookup for subsequent queries but I’m trying to find a solution that is more appropriate for my current “create a new database with the latest files from git and run some queries quickly to get a report” mode.
I suspect that using full-text indexing may be a solution here but wondering what other performance optimization options I have for this kind of look up.
Thinking about it now I definitely need to see if building the where-used table would actually be slower. That is, find every @href, resolve it and construct a map of topics to href elements that point to that topic. Hmm.
Anyway, any guidance on this challenge would be appreciated.
Cheers,
Eliot
Eliot Kimber
Sr Staff Content Engineer
O: 512 554 9368
M: 512 554 9368
servicenow.com
LinkedIn | Twitter | YouTube | Facebook
Maximilian,
That’s exactly the solution I arrived at:
To create a where-used table over all the topics in my corpus I process each DITA map or topic that has a reference to a topic and for each reference construct a map entry that maps the target to the reference (the key is the URI of the target document) captured as a map with an entry for the target topic and an entry for the list of references to that topic. (DITA maps are simply collections of links to topics, DITA maps, or non-DITA things, while topics may have cross references (xref) or content references to other topics.)
That results in a map where each entry’s value is a sequence of maps where each map has one item in the “refs” field. I iterate over the where-used map to replace each entry’s sequence of maps with a single map that has one entry for each kind of reference:
let $pointsToMeMap := map:merge( for $key in map:keys($baseMap) let $entry := $baseMap($key) let $newEntry := map{ 'topic' : $entry?topic, 'topicrefs' : $entry?topicrefs, 'xrefs' : $entry?xrefs, 'conrefs' : $entry?conrefs } return map{$key : $newEntry} )
That process takes about 45 seconds for my corpus, which is pretty good (each kind of reference takes about 15 seconds to collect, so 45 seconds to record three types of references).
For the references I’m doing a proper resolution of each reference to its target document, so the result is 100% correct, as opposed to my earlier approach, which depended on filenames being unique (which is definitely not true in my corpus even though it’s supposed to be per our local policies but it is definitely not a DITA requirement).
Obviously lookup in this where-used map will be very fast (I haven’t had a chance to measure the use of this map to do things like construct a link graph extending from some starting topic).
My next challenge is how best to persist this map in my database so I can do multiple ad-hoc queries from BaseX GUI without having to rebuild the table.
Obviously doing this through a more persistent application would be the normal solution but for now I’m just doing ad-hoc queries and don’t have available scope to build a more complete application.
Cheers,
E.
_____________________________________________ Eliot Kimber Sr Staff Content Engineer O: 512 554 9368 M: 512 554 9368 servicenow.comhttps://www.servicenow.com LinkedInhttps://www.linkedin.com/company/servicenow | Twitterhttps://twitter.com/servicenow | YouTubehttps://www.youtube.com/user/servicenowinc | Facebookhttps://www.facebook.com/servicenow
From: BaseX-Talk basex-talk-bounces@mailman.uni-konstanz.de on behalf of Maximilian Gärber mgaerber@arcor.de Date: Tuesday, January 18, 2022 at 6:31 AM To: basex-talk@mailman.uni-konstanz.de basex-talk@mailman.uni-konstanz.de Subject: Re: [basex-talk] Where-Used: Performance Improvement Strategies? [External Email]
Hi Eliot,
in similar cases, I've learned that building temporary maps is really fast.
So, instead of doing the retrieval and filtering in one step, I just construct a map with a convenient key.
In the example, I want a list of categories for articles that could exist in multiple sections (of a web site).
In a later step, I will just consult the map for the categories.
let $category-map := map:merge( for $a in $all-sections//ProductItem let $guid := $a/@Guid group by $guid return map:entry($guid, <categories>{ let $cats := for $s in $a/parent::*/parent::Section return $s/ShopCategoryId/text() for $cat in distinct-values($cats) return <_><id>{$cat}</id></_> }</categories> ) )
Best, Max
Am Fr., 14. Jan. 2022 um 16:41 Uhr schrieb Eliot Kimber eliot.kimber@servicenow.com:
In the context of my 40K topic DITA corpus, I’m trying to build a “where used” report that finds, for each topic, the other topics that directly refer to the topic. I can do this by looking for the target topic’s filename in the values of @href attributes in other topics (I’m taking advantage of a local rule we have where all topic filenames should be unique).
My current naive approach is simply:
$topics//*[tokenize(@href, '/') = $filename]
Where $topics is the 40K topics.
Based on profiling, the use of tokenize() is slightly faster than either matches() or contains(), but all forms take about 0.5 seconds per target topic, which is way too slow to make this practical in practice.
So I’m trying to work out what my performance optimization strategies are in BaseX.
In MarkLogic I would set up an index so I could do fast lookup of tokens in @href values or something similar (it’s been long enough since I had to optimize MarkLogic queries that I don’t remember the details but basically indexes for everything).
I know I could do a one-time construction of the where-used table and then use that for quick lookup for subsequent queries but I’m trying to find a solution that is more appropriate for my current “create a new database with the latest files from git and run some queries quickly to get a report” mode.
I suspect that using full-text indexing may be a solution here but wondering what other performance optimization options I have for this kind of look up.
Thinking about it now I definitely need to see if building the where-used table would actually be slower. That is, find every @href, resolve it and construct a map of topics to href elements that point to that topic. Hmm.
Anyway, any guidance on this challenge would be appreciated.
Cheers,
Eliot
Eliot Kimber
Sr Staff Content Engineer
O: 512 554 9368
M: 512 554 9368
servicenow.com
LinkedIn | Twitter | YouTube | Facebook
basex-talk@mailman.uni-konstanz.de