Applying Graph Algorithms on Knowledge Graphs:
As an engineer at Cambridge Semantics over the past 7 years, I’ve had numerous opportunities with my colleagues to implement graph analytics on an enterprise scale knowledge graph.
Growing up outside of Boston, I was always fascinated by the rich history of the area. One of the most famous examples was the ride of Paul Revere, who at the onset of the American Revolution warned colonists of the impending British Army’s march to seize the militia’s gunpowder and arms in outlying towns. His success carrying the message was due to the fact that he was one of the most socially connected people in Boston.
Finding the most influential members in a network continues to be an important need — and there are a variety of graph algorithms that can identify them. The classical method is to define a graph structure, essentially a Labeled Property Graph (LPG) schema. Data is conformed to the LPG model prior to algorithm execution. At Cambridge Semantics, with our product Anzo, we take a more novel approach that is detailed below. Anzo creates an in-memory Resource Description Framework (RDF) knowledge graph based on the original data sources. Users can then create in-memory “derived graphs” to implement algorithms.
Overview — A Real World Social Network
Consider a RDF knowledge graph based on the Linked Data Benchmark Council’s (LDBC) benchmark which describes a social network.
Note that it does not indicate the strength of relationships between individuals or even which people are friends. That is because RDF-based knowledge graphs represent knowledge by linking information originating from siloed data sources. Crucially, they represent data as it actually exists, not in an idealized structure, custom-built for a specific analysis. This is a key difference from LPG databases, where experts define a specific graph structure and create custom rules to shape incoming data to that structure.
RDF graphs can also use graph algorithms to determine who the most influential people are from this model, with a couple simple preliminary steps. In this example, we use Betweenness Centrality to detect key influencers in a social network.
Betweenness Centrality — What is it?
Betweenness Centrality measures nodes in a network that hold the most information transmission power. Most social networks will break down into clusters, with key leaders at the focal points. A small subset of individuals will be members of several clusters. These individuals are said to have high betweenness centrality, as they can quickly transmit key information to leaders of several groups. Paul Revere had a high degree of betweenness centrality in colonial Boston, as he had personal connections to leaders of major revolutionary groups.
The Real World Mechanics
To run Betweenness Centrality, a couple of preparatory steps are necessary in which we generate a derived graph from our real-world graph model. First, we need to extend the model to establish links between people that the algorithm will use. From an ontological perspective, we need to add a new predicate to the model. We can even leverage RDF* to give this connection a weight that indicates the strength of that relationship.
<<<person1> :isConnectedTo <person2>>> :weight <value>.
Next, we need to establish a heuristic that instantiates this relationship between individuals in our graph. Studying the model, logically one pattern indicating such a relationship is one individual posting a message, and another individual replying to it. We can use the number of messages (posts and comments/replies) between two individuals to compute a weight value to add to this edge. The more messages they exchange, the higher the strength of this connection. As depicted in the figure below, we apply a SPARQL query to the original knowledge graph, and the inserted statements produce a derived graph specifically designed to answer our original question using the chosen algorithm. All that’s left is to actually execute the algorithm, by far the easiest part.
<<?person1 :isConnectedTo ?person2>> :weight ?numMessages.
SELECT ?person ?person2 (COUNT(?message) AS ?numMessages)
?message :hasCreator ?person1.
?comment :replyOf ?message.
?message :hasCreator ?person2.
GROUP BY ?person ?person2
How does this scale?
The benefit of our approach is only realized if it can be applied to very large knowledge graphs comprising many sources. Fortunately, Anzo’s massively parallel processing (MPP) architecture allows users to combine data coming from numerous siloed sources to build and process knowledge graphs at unparalleled scale. A key feature of Anzo is that each data source is activated in the RDF graph engine as an in-memory data layer, on top of which additional layers can be added that logically connect, extend, and transform the knowledge graph. These additional layers are often implemented as a set SPARQL queries that describe the instructions that perform in-memory transformations and create, as in our example, derived graphs.
There are two major benefits that our method using data layers enables for executing graph algorithms. First, in the real world, users will want to rapidly test and apply different combinations of logic to create derived graphs and then execute algorithms. For example, perhaps the query above could be modified to take advantage of additional information the knowledge graph contains such as where people work or went to school. Data layers make the process of testing the heuristics and algorithms iterative. To test a change, all that is necessary is an in-memory refresh of the layer which can occur in moments. Second, the ontology and data structures do not need to be modified for each use case. If, in addition to trying to find influencers, the desired output was clusters of individuals based on forum activity, this activity could be prototyped in a new layer independent of the results of the first. This novel approach allows users to rapidly develop new use cases once the baseline knowledge graph has been established.
I’ve described one way that a specific graph algorithm can be applied in a sample RDF knowledge graph. Our team at Cambridge Semantics has decades of experience developing and applying knowledge graphs in real-world scenarios. If your interest is piqued and you’d like to have a conversation, please message me at firstname.lastname@example.org or visit our website: www.cambridgesemantics.com.
- The original inspiration for the intro is Malcolm Gladwell’s book Tipping Point. For a little more background, see this blog: https://medium.com/better-marketing/paul-revere-a-brilliant-genius-of-viral-marketing-8ac62edc459d
- If you are interested on the technical details for different centrality algorithms, see this slide deck: https://ocw.mit.edu/courses/civil-and-environmental-engineering/1-022-introduction-to-network-models-fall-2018/lecture-notes/MIT1_022F18_lec4.pdf
- Kieran Healy has this very technical blog discussing how centrality algorithms would have identified Paul Revere as a person of interest in colonial Boston: https://kieranhealy.org/blog/archives/2013/06/09/using-metadata-to-find-paul-revere/
- Figure two is an excellent visual of betweenness centrality. Again, from a highly technical source: https://cs.hse.ru/data/2015/05/14/1098547089/4._Centrality_Metrics.pdf