In the context of our Project Mirabel system that manages DITA content, I need to be able answer the question “for topic X, what other topics link to it directly or indirectly?”
That is, say Topic A links to Topic B that Links to Topic C.
Asking the question “What topics ultimately link to topic C?” I would like to get the answer “Topic A, Topic B”.
Getting the answer for direct references is easy—I already build a where-used index that captures, for each DITA map or topic, what other maps and topics link directly to it.
But to get the Topic A part of the answer I need some kind of link graph index and I’m not sure how best to go about calculating this or capturing it in some index or set of indexes.
In our content the fan out from a single Topic C to the set of topics that ultimately reference it could be 10s of 1000s of topics. We have about 45K topics in the content for each version of the ServiceNow Platform and a number of topics that are used by a large number of other topics, so the explosion can be quite large. That suggests that a simple topic-to-ultimately-referenced-topics index would be very inefficient in that the entry for any given topic could potentially have 45K – 1 entries (we don’t care that a topic references itself).
On the other hand, working backwards through chains of direct references can also be expensive and is probably too slow, so maybe the brute-force index is the best option?
At the same time, I would like to be able to quickly visualize the link graph extending from or ending in any given topic or simply the link graph for the entire information set, which requires capturing the nodes and edges.
My question: does anyone either have experience or insight into this kind of link graph challenge or know of relevant papers or general discussion of graph processing I might look at?
Thanks,
Eliot _____________________________________________ Eliot Kimber Sr Staff Content Engineer O: 512 554 9368 M: 512 554 9368 servicenow.comhttps://www.servicenow.com LinkedInhttps://www.linkedin.com/company/servicenow | Twitterhttps://twitter.com/servicenow | YouTubehttps://www.youtube.com/user/servicenowinc | Facebookhttps://www.facebook.com/servicenow
Hi Eliot -
I've wondered (but never tested/explored) about leveraging some semblance of json-ld (or serialized ttl, or something similar) and passing those values to Apache Jena (or another SPARQL processor) to use that as an inference engine. I'm deep in Speculation Territory here - I don't know what anything would look like - but you're describing an interesting problem, and it seems doable. Martynas Jusevicius (and his colleagues) have a project, LinkedDataHub, that may provide another avenue for exploring this -- I haven't used AtomGraph's applications, but he's active on the xml.com slack, and it looks like there are some interesting visualization capabilities with their work.
Our listserv friend and neighbor, Tim Thompson of Yale, may have some ideas along these lines, too. Sorry that I can't provide anything concrete, but I hope some of this is somewhat helpful. Best, Bridger
[1] https://json-ld.org/ [2] https://jena.apache.org/ [3] https://github.com/AtomGraph/LinkedDataHub
On Thu, Jun 23, 2022 at 10:35 AM Eliot Kimber eliot.kimber@servicenow.com wrote:
In the context of our Project Mirabel system that manages DITA content, I need to be able answer the question “for topic X, what other topics link to it directly or indirectly?”
That is, say Topic A links to Topic B that Links to Topic C.
Asking the question “What topics ultimately link to topic C?” I would like to get the answer “Topic A, Topic B”.
Getting the answer for direct references is easy—I already build a where-used index that captures, for each DITA map or topic, what other maps and topics link directly to it.
But to get the Topic A part of the answer I need some kind of link graph index and I’m not sure how best to go about calculating this or capturing it in some index or set of indexes.
In our content the fan out from a single Topic C to the set of topics that ultimately reference it could be 10s of 1000s of topics. We have about 45K topics in the content for each version of the ServiceNow Platform and a number of topics that are used by a large number of other topics, so the explosion can be quite large. That suggests that a simple topic-to-ultimately-referenced-topics index would be very inefficient in that the entry for any given topic could potentially have 45K – 1 entries (we don’t care that a topic references itself).
On the other hand, working backwards through chains of direct references can also be expensive and is probably too slow, so maybe the brute-force index is the best option?
At the same time, I would like to be able to quickly visualize the link graph extending from or ending in any given topic or simply the link graph for the entire information set, which requires capturing the nodes and edges.
My question: does anyone either have experience or insight into this kind of link graph challenge or know of relevant papers or general discussion of graph processing I might look at?
Thanks,
Eliot
*Eliot Kimber*
Sr Staff Content Engineer
O: 512 554 9368
M: 512 554 9368
servicenow.com https://www.servicenow.com
LinkedIn https://www.linkedin.com/company/servicenow | Twitter https://twitter.com/servicenow | YouTube https://www.youtube.com/user/servicenowinc | Facebook https://www.facebook.com/servicenow
Thanks, Bridger. I agree this seems like a use case for graph technologies (RDF/SPARQL or labeled property graphs). SPARQL 1.1 includes property paths, which make it possible to query on transitive properties (e.g., A contains B, B contains C). One example from Wikidata: https://twitter.com/andre_ourednik/status/1427264453217763336.
There are also document-based representations, as Bridger mentions: JSON-LD, RDF/XML, and TriX are supported by RDF tools; there's also GraphML and GEXF, supported by the Gephi platform[1] for network visualization and also Python tools like NetworkX[2].
Would be interesting to test how a recursive approach using db:attribute, etc., over a link index would scale in BaseX.
Tim
[1] https://gephi.org/ [2] https://networkx.org/
And there are graph databases not based on RDF: Neo4j, Tinkerpop and others.
For gathering metrics from a DITA CCMS, I used Blazegraph and querie using SPARQL. You can express just those sort of things: What links to this.
About that last paragraph: I've also wondered how representing mostly hierarchical data scales in a graph database.
Kendall
On Thursday, June 23, 2022 1:02:45 PM (-07:00), Tim Thompson wrote:
Thanks, Bridger. I agree this seems like a use case for graph technologies (RDF/SPARQL or labeled property graphs). SPARQL 1.1 includes property paths, which make it possible to query on transitive properties (e.g., A contains B, B contains C). One example from Wikidata: https://twitter.com/andre_ourednik/status/1427264453217763336.
There are also document-based representations, as Bridger mentions: JSON-LD, RDF/XML, and TriX are supported by RDF tools; there's also GraphML and GEXF, supported by the Gephi platform[1] for network visualization and also Python tools like NetworkX[2].
Would be interesting to test how a recursive approach using db:attribute, etc., over a link index would scale in BaseX.
Tim
[1] https://gephi.org/ [2] https://networkx.org/
-- Tim A. Thompson (he, him) Librarian for Applied Metadata Research Yale University Library
On Thu, Jun 23, 2022 at 1:09 PM Bridger Dyson-Smith bdysonsmith@gmail.com wrote:
Hi Eliot -
I've wondered (but never tested/explored) about leveraging some semblance of json-ld (or serialized ttl, or something similar) and passing those values to Apache Jena (or another SPARQL processor) to use that as an inference engine. I'm deep in Speculation Territory here - I don't know what anything would look like - but you're describing an interesting problem, and it seems doable. Martynas Jusevicius (and his colleagues) have a project, LinkedDataHub, that may provide another avenue for exploring this -- I haven't used AtomGraph's applications, but he's active on the xml.com slack, and it looks like there are some interesting visualization capabilities with their work.
Our listserv friend and neighbor, Tim Thompson of Yale, may have some ideas along these lines, too. Sorry that I can't provide anything concrete, but I hope some of this is somewhat helpful.
Best, Bridger
[1] https://json-ld.org/ [2] https://jena.apache.org/ [3] https://github.com/AtomGraph/LinkedDataHub
On Thu, Jun 23, 2022 at 10:35 AM Eliot Kimber eliot.kimber@servicenow.com wrote:
In the context of our Project Mirabel system that manages DITA content, I need to be able answer the question “for topic X, what other topics link to it directly or indirectly?”
That is, say Topic A links to Topic B that Links to Topic C.
Asking the question “What topics ultimately link to topic C?” I would like to get the answer “Topic A, Topic B”.
Getting the answer for direct references is easy—I already build a where-used index that captures, for each DITA map or topic, what other maps and topics link directly to it.
But to get the Topic A part of the answer I need some kind of link graph index and I’m not sure how best to go about calculating this or capturing it in some index or set of indexes.
In our content the fan out from a single Topic C to the set of topics that ultimately reference it could be 10s of 1000s of topics. We have about 45K topics in the content for each version of the ServiceNow Platform and a number of topics that are used by a large number of other topics, so the explosion can be quite large. That suggests that a simple topic-to-ultimately-referenced-topics index would be very inefficient in that the entry for any given topic could potentially have 45K – 1 entries (we don’t care that a topic references itself).
On the other hand, working backwards through chains of direct references can also be expensive and is probably too slow, so maybe the brute-force index is the best option?
At the same time, I would like to be able to quickly visualize the link graph extending from or ending in any given topic or simply the link graph for the entire information set, which requires capturing the nodes and edges.
My question: does anyone either have experience or insight into this kind of link graph challenge or know of relevant papers or general discussion of graph processing I might look at?
Thanks,
Eliot
_____________________________________________
Eliot Kimber
Sr Staff Content Engineer
O: 512 554 9368
M: 512 554 9368
servicenow.com
LinkedIn | Twitter | YouTube | Facebook
basex-talk@mailman.uni-konstanz.de