Link Graph Construction: Anything I Can Crib or Learn From?

List overview All Threads
Download

newer

older

XPATH injection / escaping

Construct URL-Friendly Base64...

Eliot Kimber

23 Jun 2022 23 Jun '22

10:35 a.m.

In the context of our Project Mirabel system that manages DITA content, I need to be able answer the question “for topic X, what other topics link to it directly or indirectly?”

That is, say Topic A links to Topic B that Links to Topic C.

Asking the question “What topics ultimately link to topic C?” I would like to get the answer “Topic A, Topic B”.

Getting the answer for direct references is easy—I already build a where-used index that captures, for each DITA map or topic, what other maps and topics link directly to it.

But to get the Topic A part of the answer I need some kind of link graph index and I’m not sure how best to go about calculating this or capturing it in some index or set of indexes.

In our content the fan out from a single Topic C to the set of topics that ultimately reference it could be 10s of 1000s of topics. We have about 45K topics in the content for each version of the ServiceNow Platform and a number of topics that are used by a large number of other topics, so the explosion can be quite large. That suggests that a simple topic-to-ultimately-referenced-topics index would be very inefficient in that the entry for any given topic could potentially have 45K – 1 entries (we don’t care that a topic references itself).

On the other hand, working backwards through chains of direct references can also be expensive and is probably too slow, so maybe the brute-force index is the best option?

At the same time, I would like to be able to quickly visualize the link graph extending from or ending in any given topic or simply the link graph for the entire information set, which requires capturing the nodes and edges.

My question: does anyone either have experience or insight into this kind of link graph challenge or know of relevant papers or general discussion of graph processing I might look at?

Thanks,

Eliot _____________________________________________ Eliot Kimber Sr Staff Content Engineer O: 512 554 9368 M: 512 554 9368 servicenow.comhttps://www.servicenow.com LinkedInhttps://www.linkedin.com/company/servicenow | Twitterhttps://twitter.com/servicenow | YouTubehttps://www.youtube.com/user/servicenowinc | Facebookhttps://www.facebook.com/servicenow

Attachments:

attachment.html (text/html — 5.7 KB)

Show replies by date

Bridger Dyson-Smith

23 Jun 23 Jun

1:08 p.m.

Hi Eliot -

I've wondered (but never tested/explored) about leveraging some semblance of json-ld (or serialized ttl, or something similar) and passing those values to Apache Jena (or another SPARQL processor) to use that as an inference engine. I'm deep in Speculation Territory here - I don't know what anything would look like - but you're describing an interesting problem, and it seems doable. Martynas Jusevicius (and his colleagues) have a project, LinkedDataHub, that may provide another avenue for exploring this -- I haven't used AtomGraph's applications, but he's active on the xml.com slack, and it looks like there are some interesting visualization capabilities with their work.

Our listserv friend and neighbor, Tim Thompson of Yale, may have some ideas along these lines, too. Sorry that I can't provide anything concrete, but I hope some of this is somewhat helpful. Best, Bridger

[1] https://json-ld.org/ [2] https://jena.apache.org/ [3] https://github.com/AtomGraph/LinkedDataHub

On Thu, Jun 23, 2022 at 10:35 AM Eliot Kimber eliot.kimber@servicenow.com wrote:

...

In the context of our Project Mirabel system that manages DITA content, I need to be able answer the question “for topic X, what other topics link to it directly or indirectly?”

That is, say Topic A links to Topic B that Links to Topic C.

Asking the question “What topics ultimately link to topic C?” I would like to get the answer “Topic A, Topic B”.

Getting the answer for direct references is easy—I already build a where-used index that captures, for each DITA map or topic, what other maps and topics link directly to it.

But to get the Topic A part of the answer I need some kind of link graph index and I’m not sure how best to go about calculating this or capturing it in some index or set of indexes.

In our content the fan out from a single Topic C to the set of topics that ultimately reference it could be 10s of 1000s of topics. We have about 45K topics in the content for each version of the ServiceNow Platform and a number of topics that are used by a large number of other topics, so the explosion can be quite large. That suggests that a simple topic-to-ultimately-referenced-topics index would be very inefficient in that the entry for any given topic could potentially have 45K – 1 entries (we don’t care that a topic references itself).

On the other hand, working backwards through chains of direct references can also be expensive and is probably too slow, so maybe the brute-force index is the best option?

At the same time, I would like to be able to quickly visualize the link graph extending from or ending in any given topic or simply the link graph for the entire information set, which requires capturing the nodes and edges.

My question: does anyone either have experience or insight into this kind of link graph challenge or know of relevant papers or general discussion of graph processing I might look at?

Thanks,

Eliot

*Eliot Kimber*

Sr Staff Content Engineer

O: 512 554 9368

M: 512 554 9368

servicenow.com https://www.servicenow.com

LinkedIn https://www.linkedin.com/company/servicenow | Twitter https://twitter.com/servicenow | YouTube https://www.youtube.com/user/servicenowinc | Facebook https://www.facebook.com/servicenow

Tim Thompson

4:02 p.m.

Thanks, Bridger. I agree this seems like a use case for graph technologies (RDF/SPARQL or labeled property graphs). SPARQL 1.1 includes property paths, which make it possible to query on transitive properties (e.g., A contains B, B contains C). One example from Wikidata: https://twitter.com/andre_ourednik/status/1427264453217763336.

There are also document-based representations, as Bridger mentions: JSON-LD, RDF/XML, and TriX are supported by RDF tools; there's also GraphML and GEXF, supported by the Gephi platform[1] for network visualization and also Python tools like NetworkX[2].

Would be interesting to test how a recursive approach using db:attribute, etc., over a link index would scale in BaseX.

Tim

[1] https://gephi.org/ [2] https://networkx.org/

-- Tim A. Thompson (he, him) Librarian for Applied Metadata Research Yale University Library On Thu, Jun 23, 2022 at 1:09 PM Bridger Dyson-Smith bdysonsmith@gmail.com wrote: > Hi Eliot - > > I've wondered (but never tested/explored) about leveraging some semblance > of json-ld (or serialized ttl, or something similar) and passing those > values to Apache Jena (or another SPARQL processor) to use that as an > inference engine. I'm deep in Speculation Territory here - I don't know > what anything would look like - but you're describing an interesting > problem, and it seems doable. Martynas Jusevicius (and his colleagues) have > a project, LinkedDataHub, that may provide another avenue for exploring > this -- I haven't used AtomGraph's applications, but he's active on the > xml.com slack, and it looks like there are some interesting visualization > capabilities with their work. > > Our listserv friend and neighbor, Tim Thompson of Yale, may have some > ideas along these lines, too. > Sorry that I can't provide anything concrete, but I hope some of this is > somewhat helpful. > Best, > Bridger > > [1] https://json-ld.org/ > [2] https://jena.apache.org/ > [3] https://github.com/AtomGraph/LinkedDataHub > > On Thu, Jun 23, 2022 at 10:35 AM Eliot Kimber eliot.kimber@servicenow.com > wrote: > >> In the context of our Project Mirabel system that manages DITA content, I >> need to be able answer the question “for topic X, what other topics link to >> it directly or indirectly?” >> >> >> >> That is, say Topic A links to Topic B that Links to Topic C. >> >> >> >> Asking the question “What topics ultimately link to topic C?” I would >> like to get the answer “Topic A, Topic B”. >> >> >> >> Getting the answer for direct references is easy—I already build a >> where-used index that captures, for each DITA map or topic, what other maps >> and topics link directly to it. >> >> >> >> But to get the Topic A part of the answer I need some kind of link graph >> index and I’m not sure how best to go about calculating this or capturing >> it in some index or set of indexes. >> >> >> >> In our content the fan out from a single Topic C to the set of topics >> that ultimately reference it could be 10s of 1000s of topics. We have about >> 45K topics in the content for each version of the ServiceNow Platform and a >> number of topics that are used by a large number of other topics, so the >> explosion can be quite large. That suggests that a simple >> topic-to-ultimately-referenced-topics index would be very inefficient in >> that the entry for any given topic could potentially have 45K – 1 entries >> (we don’t care that a topic references itself). >> >> >> >> On the other hand, working backwards through chains of direct references >> can also be expensive and is probably too slow, so maybe the brute-force >> index is the best option? >> >> >> >> At the same time, I would like to be able to quickly visualize the link >> graph extending from or ending in any given topic or simply the link graph >> for the entire information set, which requires capturing the nodes and >> edges. >> >> >> >> My question: does anyone either have experience or insight into this kind >> of link graph challenge or know of relevant papers or general discussion of >> graph processing I might look at? >> >> >> >> Thanks, >> >> >> >> Eliot >> >> _____________________________________________ >> >> *Eliot Kimber* >> >> Sr Staff Content Engineer >> >> O: 512 554 9368 >> >> M: 512 554 9368 >> >> servicenow.com https://www.servicenow.com >> >> LinkedIn https://www.linkedin.com/company/servicenow | Twitter >> https://twitter.com/servicenow | YouTube >> https://www.youtube.com/user/servicenowinc | Facebook >> https://www.facebook.com/servicenow >> >

Kendall Shaw

26 Jun 26 Jun

2:03 a.m.

And there are graph databases not based on RDF: Neo4j, Tinkerpop and others.

For gathering metrics from a DITA CCMS, I used Blazegraph and querie using SPARQL. You can express just those sort of things: What links to this.

About that last paragraph: I've also wondered how representing mostly hierarchical data scales in a graph database.

Kendall

On Thursday, June 23, 2022 1:02:45 PM (-07:00), Tim Thompson wrote:

Would be interesting to test how a recursive approach using db:attribute, etc., over a link index would scale in BaseX.

Tim

[1] https://gephi.org/ [2] https://networkx.org/

-- Tim A. Thompson (he, him) Librarian for Applied Metadata Research Yale University Library

On Thu, Jun 23, 2022 at 1:09 PM Bridger Dyson-Smith bdysonsmith@gmail.com wrote:

Hi Eliot -

Our listserv friend and neighbor, Tim Thompson of Yale, may have some ideas along these lines, too. Sorry that I can't provide anything concrete, but I hope some of this is somewhat helpful.

Best, Bridger

[1] https://json-ld.org/ [2] https://jena.apache.org/ [3] https://github.com/AtomGraph/LinkedDataHub

On Thu, Jun 23, 2022 at 10:35 AM Eliot Kimber eliot.kimber@servicenow.com wrote:

In the context of our Project Mirabel system that manages DITA content, I need to be able answer the question “for topic X, what other topics link to it directly or indirectly?”

That is, say Topic A links to Topic B that Links to Topic C.

Asking the question “What topics ultimately link to topic C?” I would like to get the answer “Topic A, Topic B”.

Getting the answer for direct references is easy—I already build a where-used index that captures, for each DITA map or topic, what other maps and topics link directly to it.

But to get the Topic A part of the answer I need some kind of link graph index and I’m not sure how best to go about calculating this or capturing it in some index or set of indexes.

On the other hand, working backwards through chains of direct references can also be expensive and is probably too slow, so maybe the brute-force index is the best option?

My question: does anyone either have experience or insight into this kind of link graph challenge or know of relevant papers or general discussion of graph processing I might look at?

Thanks,

Eliot

_____________________________________________

Eliot Kimber

Sr Staff Content Engineer

O: 512 554 9368

M: 512 554 9368

servicenow.com

LinkedIn | Twitter | YouTube | Facebook

-- Sent with Vivaldi Mail. Download Vivaldi for free at vivaldi.com

1117

Age (days ago)

1120

Last active (days ago)

basex-talk@mailman.uni-konstanz.de

3 comments

4 participants

tags (0)

participants (4)

Bridger Dyson-Smith
Eliot Kimber
Kendall Shaw
Tim Thompson