PageRank

This post explores PageRank, the foundational algorithm that powers Google’s search rankings. While originally designed for web pages, understanding PageRank’s mechanics and mathematical foundations provides crucial context for its variants like ArticleRank, which I developed for the Delphi project to better handle academic citation networks.

PageRank

The point of PageRank is to take a directed graph and determine the importance of each node in the graph. It does this by iteratively passing around importance values directionally from one node to the other (but not in reverse). Since Brin and Page invented PageRank to determine the importance of web pages originally, let’s think about webpages linking to others as our directed graph.

There are two main heuristics that drive the updates each iteration:

The more important a node is, the more likely it is to be linked to by other nodes. For web pages, if add I add a link to Google, this would boost Google’s PageRank a tiny bit in theory.
A website is more important if more important websites link to it, all else equal. If Google put a link to timholds.com on their homepage, my PageRank would get an astronomical boost.

The Algorithm in Practice

Here’s a simplified view of how PageRank actually works:

Initial state: Every page starts with equal importance (1/N)
Each iteration:
  - Each page divides its current PageRank by the number of outgoing links
  - It sends this fraction to each page it links to
  - Each page's new PageRank ~= sum of all incoming fractions

The equation is:

\[PR(p) = \frac{1-d}{N} + d \times \sum_{q \in M(p)} \frac{PR(q)}{C(q)}\]

Where:

\(PR(p)\) is the PageRank of page \(p\)
\(d\) is the damping factor (usually 0.85)
\(N\) is the total number of pages
\(M(p)\) is the set of pages that link to page \(p\)
\(C(q)\) is the number of outgoing links from page \(q\)

Mathematical Motivation

TL;DR PageRank has both an iterative and analytical solution.

Graphs and Matrices


Graphs can be represented as matrices!

Every graph, directed or undirected, can be represented as an adjacency matrix. Linear algebra is an incredibly powerful tool for saying things about matrices, and by converting a graph into a matrix we get the full firepower of linear algebra to say things about the graph.

If we normalize the adjacency matrix, we get the transition matrix. The transition matrix is a stochastic matrix where each column sums to 1, representing the probability of moving from one node to another. More intuitively, the transition matrix tells us the probability of hopping from one node to another in the graph.

Random Surfer

The random surfer model asks: “If a user randomly clicks links on the web, what fraction of time would they spend on each page in the long run?” This long-run probability is just the PageRank! In other words, the PageRank of a page is the probability of finding the random surfer on that page at any given moment.

A surfer traverses a graph

The mechanics of the random surfer starts at a random node and traverse the edges randomly according to the transition matrix. The damping factor (usually 0.85) allows a surfer who follows links 85% of the time to jump to any page 15% of the time. This prevents rank sinks and ensures convergence. This modifies the transition matrix to include a uniform distribution over all pages, ensuring that every page has a chance of being visited even if it has no incoming links.

Here’s the modified transition matrix:

\[T = d \cdot P + (1 - d) \cdot U\]

Where:

\(T\) is the modified transition matrix
\(d\) is the damping factor (usually around 0.85)
\(P\) is the original transition matrix (normalized adjacency matrix)
\(U\) is the uniform distribution matrix, where each entry is \(\frac{1}{N}\) (with \(N\) being the total number of pages)

This modification ensures that the random surfer can always jump to any page, preventing rank sinks and ensuring convergence of the PageRank algorithm. In other words, this prevents the case where a node could accumulate infinite PageRank by having no outgoing links.

The Punchline

Hopping from page to connected page probabilastically maps really nicely into the beloved Markov chain. If we let the random surfer hop around according to transition matrix, eventually it the probability of each node getting landed on will converge. Linear algebra already knew that! The dominant eigenvector (the one corresponding to the largest eigenvalue) of this transition matrix represents the steady state distribution, which are the values that PageRank converges to.

All three of these concepts are getting at the same thing:
The steady-state probability of the random surfer being on a page
The dominant eigenvector of the modified adjacency matrix
The converged values from the iterative PageRank formula