Applying Graph Algorithms on Knowledge Graphs:

Finding Influencers in a Real-World Social Network

As an engineer at Cambridge Semantics over the past 7 years, I’ve had numerous opportunities with my colleagues to implement graph analytics on an enterprise scale knowledge graph.

Introduction

Finding the most influential members in a network continues to be an important need — and there are a variety of graph algorithms that can identify them. The classical method is to define a graph structure, essentially a Labeled Property Graph (LPG) schema. Data is conformed to the LPG model prior to algorithm execution. At Cambridge Semantics, with our product Anzo, we take a more novel approach that is detailed below. Anzo creates an in-memory Resource Description Framework (RDF) knowledge graph based on the original data sources. Users can then create in-memory “derived graphs” to implement algorithms.

Overview — A Real World Social Network

Consider a RDF knowledge graph based on the Linked Data Benchmark Council’s (LDBC) benchmark which describes a social network.

Fig 1: Social network in a real-world model.

Note that it does not indicate the strength of relationships between individuals or even which people are friends. That is because RDF-based knowledge graphs represent knowledge by linking information originating from siloed data sources. Crucially, they represent data as it actually exists, not in an idealized structure, custom-built for a specific analysis. This is a key difference from LPG databases, where experts define a specific graph structure and create custom rules to shape incoming data to that structure.

RDF graphs can also use graph algorithms to determine who the most influential people are from this model, with a couple simple preliminary steps. In this example, we use Betweenness Centrality to detect key influencers in a social network.

Betweenness Centrality — What is it?

Fig 2: Example network graph. The red nodes have the highest betweenness centrality.

The Real World Mechanics

<<<person1> :isConnectedTo <person2>>> :weight <value>.

Next, we need to establish a heuristic that instantiates this relationship between individuals in our graph. Studying the model, logically one pattern indicating such a relationship is one individual posting a message, and another individual replying to it. We can use the number of messages (posts and comments/replies) between two individuals to compute a weight value to add to this edge. The more messages they exchange, the higher the strength of this connection. As depicted in the figure below, we apply a SPARQL query to the original knowledge graph, and the inserted statements produce a derived graph specifically designed to answer our original question using the chosen algorithm. All that’s left is to actually execute the algorithm, by far the easiest part.

INSERT {
<<?person1 :isConnectedTo ?person2>> :weight ?numMessages.
}
WHERE {
{
SELECT ?person ?person2 (COUNT(?message) AS ?numMessages)
WHERE {
?message :hasCreator ?person1.
?comment :replyOf ?message.
?message :hasCreator ?person2.
}
GROUP BY ?person ?person2
}
}

How does this scale?

There are two major benefits that our method using data layers enables for executing graph algorithms. First, in the real world, users will want to rapidly test and apply different combinations of logic to create derived graphs and then execute algorithms. For example, perhaps the query above could be modified to take advantage of additional information the knowledge graph contains such as where people work or went to school. Data layers make the process of testing the heuristics and algorithms iterative. To test a change, all that is necessary is an in-memory refresh of the layer which can occur in moments. Second, the ontology and data structures do not need to be modified for each use case. If, in addition to trying to find influencers, the desired output was clusters of individuals based on forum activity, this activity could be prototyped in a new layer independent of the results of the first. This novel approach allows users to rapidly develop new use cases once the baseline knowledge graph has been established.

I’ve described one way that a specific graph algorithm can be applied in a sample RDF knowledge graph. Our team at Cambridge Semantics has decades of experience developing and applying knowledge graphs in real-world scenarios. If your interest is piqued and you’d like to have a conversation, please message me at greg@cambridgesemantics.com or visit our website: www.cambridgesemantics.com.

Sources:

  1. If you are interested on the technical details for different centrality algorithms, see this slide deck: https://ocw.mit.edu/courses/civil-and-environmental-engineering/1-022-introduction-to-network-models-fall-2018/lecture-notes/MIT1_022F18_lec4.pdf
  2. Kieran Healy has this very technical blog discussing how centrality algorithms would have identified Paul Revere as a person of interest in colonial Boston: https://kieranhealy.org/blog/archives/2013/06/09/using-metadata-to-find-paul-revere/
  3. Figure two is an excellent visual of betweenness centrality. Again, from a highly technical source: https://cs.hse.ru/data/2015/05/14/1098547089/4._Centrality_Metrics.pdf

Principal Presales Engineer, Cambridge Semantics Inc. | Graph Technology Expert | Intrepid Adventurer