ArticleRank
This post dives deep into ArticleRank, a variant of PageRank that I used in the Delphi project for predicting high-impact research papers. While Delphi was about building a machine learning system to predict which papers would become influential, ArticleRank was one of the key algorithmic innovations that made it work.
Weaknesses and Biases of PageRank
While PageRank works brilliantly for web pages, it has some problematic biases when applied to academic citation networks:
-
The “Rich Get Richer” Problem: Papers with many citations accumulate PageRank faster, creating a snowball effect that can overshadow newer, potentially groundbreaking work.
-
Field Size Bias: Large fields (like machine learning) naturally have more papers and citations, inflating PageRank scores compared to smaller but equally impactful fields.
-
Citation Practice Differences: Some fields cite extensively (50+ references), while others are sparse (10-15 references). PageRank doesn’t account for these cultural differences.
-
The Review Paper Problem: Review papers that cite hundreds of works can create massive PageRank distortions, especially when they only cite a few foundational papers.
In regards to academic citation graphs, the general shape and framing of the PageRank algorithm is still useful, but there are some tweaks that need to be made that end up giving us ArticleRank.
The problem with PageRank on citation graphs
These biases are okay in the original context of web pages, but for citation graphs they are insufficient. The most problematic case occurs when a highly-cited paper cites very few others - it essentially becomes a PageRank “firehose” that can artificially inflate the importance of its references.
ArticleRank: A Smarter Approach
ArticleRank addresses these biases by modifying the PageRank formula to account for the average citation behavior in the network. The key insight is to normalize the influence a paper can pass on based on how many papers it cites relative to the average.
The Mathematical Fix
ArticleRank modifies the denominator of the PageRank formula:
\[AR(p) = \frac{1-d}{N} + d \times \sum_{q \in M(p)} \frac{AR(q)}{C(q) + N_{avg}}\]The crucial difference is adding \(N_{avg}\) (the average out-degree of the network) to the denominator. This seemingly small change has profound effects!
Why This Works
By adding a term to the denominator with the average out-degree of the graph, we can partially normalize against outliers that have very few outgoing citations. This prevents the “firehose effect” where a highly-cited paper with few references disproportionately boosts those references.
Here’s a concrete example:
- Paper A: A groundbreaking review paper with 1000 citations, but it only cites 1 paper (Paper B)
- Paper B: The foundational work that Paper A builds upon
- Average out-degree (\(N_{avg}\)): Let’s say it’s 20 (typical for many fields)
With PageRank: Paper A would pass ALL its accumulated rank to Paper B: \(\text{Rank passed} = \frac{PR(A)}{1} = PR(A)\)
With ArticleRank: Paper A passes a normalized amount: \(\text{Rank passed} = \frac{AR(A)}{1 + 20} = \frac{AR(A)}{21}\)
This reduces the distortion by ~95%!
This normalization is exactly what we needed for citation networks - it respects the importance of highly-cited papers while preventing them from creating artificial importance spikes in their references.
Squeezing out a time dimension
One of the most interesting applications of ArticleRank is incorporating time. In the Delphi project, we needed to understand not just which papers were important, but how their importance evolved over time. This led to creating time-windowed ArticleRank scores.
Time-Sensitive ArticleRank
Instead of computing ArticleRank on the entire citation graph, we computed it on temporal snapshots:
- ArticleRank at 1 year after publication
- ArticleRank at 3 years after publication
- ArticleRank at 5 years after publication
This gave us a trajectory of importance rather than a single number. Papers that showed rapid ArticleRank growth in their early years often became field-defining works later on.
Implementation and Impact
I’m particularly proud that this work led to ArticleRank being added to the Neo4j graph algorithm library. Working with the Neo4j team to implement and test the algorithm was a fantastic experience in open-source.
Neo4j ArticleRank: https://neo4j.com/docs/graph-data-science/current/algorithms/article-rank/
Conclusion
ArticleRank demonstrates how small, thoughtful modifications to existing algorithms can make them suitable for entirely new domains. By adding just one term to PageRank’s formula, we created an algorithm that better captures the nuances of academic influence.
In the broader context of the Delphi project, ArticleRank became one of many features that helped us predict which papers would have lasting impact on their fields. But that’s a story for the Delphi post!