Applying Graph Algorithms on Knowledge Graphs:

Finding Influencers in a Real-World Social Network

As an engineer at Cambridge Semantics over the past 7 years, I’ve had numerous opportunities with my colleagues to implement graph analytics on an enterprise scale knowledge graph.

Introduction

Growing up outside of Boston, I was always fascinated by the rich history of the area. One of the most famous examples was the ride of Paul Revere, who at the onset of the American Revolution warned colonists of the impending British Army’s march to seize the militia’s gunpowder and arms in outlying towns. His success carrying the message was due to the fact that he was one of the most socially connected people in Boston.

Fig 1: Social network in a real-world model.

Betweenness Centrality — What is it?

Betweenness Centrality measures nodes in a network that hold the most information transmission power. Most social networks will break down into clusters, with key leaders at the focal points. A small subset of individuals will be members of several clusters. These individuals are said to have high betweenness centrality, as they can quickly transmit key information to leaders of several groups. Paul Revere had a high degree of betweenness centrality in colonial Boston, as he had personal connections to leaders of major revolutionary groups.

Fig 2: Example network graph. The red nodes have the highest betweenness centrality.

The Real World Mechanics

To run Betweenness Centrality, a couple of preparatory steps are necessary in which we generate a derived graph from our real-world graph model. First, we need to extend the model to establish links between people that the algorithm will use. From an ontological perspective, we need to add a new predicate to the model. We can even leverage RDF* to give this connection a weight that indicates the strength of that relationship.

<<<person1> :isConnectedTo <person2>>> :weight <value>.
INSERT {
<<?person1 :isConnectedTo ?person2>> :weight ?numMessages.
}
WHERE {
{
SELECT ?person ?person2 (COUNT(?message) AS ?numMessages)
WHERE {
?message :hasCreator ?person1.
?comment :replyOf ?message.
?message :hasCreator ?person2.
}
GROUP BY ?person ?person2
}
}

How does this scale?

The benefit of our approach is only realized if it can be applied to very large knowledge graphs comprising many sources. Fortunately, Anzo’s massively parallel processing (MPP) architecture allows users to combine data coming from numerous siloed sources to build and process knowledge graphs at unparalleled scale. A key feature of Anzo is that each data source is activated in the RDF graph engine as an in-memory data layer, on top of which additional layers can be added that logically connect, extend, and transform the knowledge graph. These additional layers are often implemented as a set SPARQL queries that describe the instructions that perform in-memory transformations and create, as in our example, derived graphs.

Sources:

  1. The original inspiration for the intro is Malcolm Gladwell’s book Tipping Point. For a little more background, see this blog: https://medium.com/better-marketing/paul-revere-a-brilliant-genius-of-viral-marketing-8ac62edc459d
  2. If you are interested on the technical details for different centrality algorithms, see this slide deck: https://ocw.mit.edu/courses/civil-and-environmental-engineering/1-022-introduction-to-network-models-fall-2018/lecture-notes/MIT1_022F18_lec4.pdf
  3. Kieran Healy has this very technical blog discussing how centrality algorithms would have identified Paul Revere as a person of interest in colonial Boston: https://kieranhealy.org/blog/archives/2013/06/09/using-metadata-to-find-paul-revere/
  4. Figure two is an excellent visual of betweenness centrality. Again, from a highly technical source: https://cs.hse.ru/data/2015/05/14/1098547089/4._Centrality_Metrics.pdf

Principal Presales Engineer, Cambridge Semantics Inc. | Graph Technology Expert | Intrepid Adventurer