The Search That Changed The World

In 1996, two graduate students from Stanford University conceived of an algorithm so powerful, so far-reaching, it completely re-shaped our understanding of the technological landscape that is the WorldWide Web. They called this algorithm PageRank, and with it, the two students, Larry Page and Sergey Brin, went on to create the most successful and widely used search engine to this day - Google. Prior to Google, the search engines that dominated at the time all used algorithms that ultimately ranked pages based off the frequency of the appearance of the search term(s) in a pages content. The key insight of Larry Page and Sergey Brin was to see the Web for what it is - a network (a graph), and to realize the value of a page (a node within the graph), is best estimated NOT by the appearance of the query term(s) within the pages content, but by the number and respective value of other pages (other nodes) which link to said page. That is - the relative importance of a page is dependent on the number and quality of other pages that link to it.

A visual example of a small Web network with the size and percentage of a node proportional to it's ranking via PageRank. Notice page C has a higher ranking than page E even though page C has only one link. But that one link comes from page B which has a high ranking (the highest in this case). In this way, both quality and number of links are factored when determining rank in the PageRank algorithm. [Credit to Wikipedia for the image]

This algorithm is so effective, it continues to serve as the foundation for all search implementations to this day. In this article, I will dive into the mathematics behind the PageRank algorithm. As a forewarning, I will not sugarcoat any of it because I truly believe this is something worth understanding. However, it does require a level of exposure to elementary matrix algebra and probability theory to be appreciated. If it is of no interest you to understand the algorithms inner workings in detail, you may stop here. I would like to point out though, that this algorithm is actually simple enough that any self-respecting introductory course/book on Linear Algebra will cover it as a real world example of applied matrix computation.

PageRank views the act of Web surfing as a random process - specifically, a Markov process. The transition matrix A for the Markov process is an n x n matrix where n is the total number of sites that are searched. The (i, j)-th entry of the matrix, that is A[ i ][ j ], is the probability that a random Web user will go from webpage j to webpage i. The model makes the assumption that a user will always follow a link on the current page with a certain percentage, or will otherwise randomly go to another page which is not linked to the current page - the essence of a Markov chain.

As an example, say the current webpage j has links to four other pages and the user will follow these four links 75% of the time. The other 25% of the time the user will randomly surf to another page that may or may not be linked to the current page j. In this case, if the next page the user goes to is numbered i, and there is no link on the current page j to page i, then

A[ i ][ j ] = 0.25 · (1/n)

where n is, as defined before, the total number of pages searched. That is, there is a 25 divided by n precent chance of an arbitrary user going from the current webpage j to a webpage i that has no direct link on page j (the 25% chance is divided evenly between all existing n pages) - make sense so far?

Now, if the current webpage j DOES have a direct link to webpage i, then the user could either follow that link, or arrive at webpage i via a random surf. Hence,

A[ i ][ j ] = 0.75 · (1/4) + 0.25 · (1/n)

If instead, the current webpage has no links to any other pages (termed a 'dangling page') then

A[ i ][ j ] = 1/n      for   1 < i < n

We now decompose the matrix A into two components. Let L( j ) denote the number of links from webpage j to other pages on the Web. If L( j ) ≠ 0 and a user only clicks links on the current page and will always click one of the links, then the probability of going from page j to page i is given by the n x n matrix M where the (i, j)-th entry is

M[ i ][ j ] = 1/L( j )     if there exists a link on page j to page i

otherwise,   M[ i ][ j ] = 0

If page j is a 'dangling page' (has no links on the page to any other pages) then

M[ i ][ j ] = 1/n      for   1 < i < n

Lets now assume that a web surfer will follow a link on the current page j with probability p, and randomly surf to any other page with probability 1 - p. Then the probability of going from page j to page i, the (i, j)-th entry of the transition matrix A, is given by

A[ i ][ j ] = p · M[ i ][ j ] + (1 - p) · (1/n)

Due to the assumption of random surfing, every entry in the transition matrix A is strictly positive, and so it follows from Perron theory that the Markov process will converge to a unique steady-state (n x 1) column vector r, where r[ i ] is the page ranking of webpage i for 1 < i < n    ( r[ i ][0] for you programmers ).

r[ i ] is equivalently the probability of going to webpage i from any other random webpage in the set of total pages. Also, by definition, the sum of the elements in r must add up to 1.

r[0] + r[1] + r[2] + ... + r[n] = 1

The process to find the PageRank vector r is as follows:

Begin with r[ i ] = 1/n  for 1 < i < n, that is, we assume every page has equal ranking in the beginning.

The formula

rt+1 = A·rt      where A is the transition matrix defined above

is run iteratively for time steps t = 0, 1, 2, ... until

|rt+1 - rt| < ε

for some small ε (when the approximation to the steady state is deemed effectively accurate).

Notice that at any time step (the state between successive iterations - or by the model, the time between a surfers random page changes)

r[ i ] = p · ( r[0]/L(0) + r[1]/L(1) + r[2]/L(2) + ... + r[n]/L(n) ) + (1 - p) · (1/n)

That is, the page ranking of webpage i, r[ i ], at an arbitrary users next random surf is dependent in large part from the page rankings of all other pages.

The probability of surfing to a page that has a link on the current page, denoted p in our formula, is what's called a damping factor. It can be equivalently viewed as the probability that a user will continue surfing at any time step. This damping factor is generally set around 0.85 - that is, it is assumed a random surfer will continue surfing 85% of the time.

Finally, putting it all together. When a Web search is conducted using PageRank, the search engine first finds all sites that match the query criteria. It then lists them in decreasing order of their page ranks.

With this algorithm, Google solidified themselves as the leader of search, the principal action for the WorldWide Webs utility. Today, Google's ranking algorithm considers many other factors, but PageRank continues to serve as the basis of it's search algorithms for all of their various tools.

I hope some of you actually made it this far. If you did, thank you and congratulations. I wrote this as much as for the sake of a refresher for myself as I did for my desire to share it with you all. As always, stay passionate, stay curious, and take care of yourselves.


Credit to Wikipedia for the pictures and explanations. Credit also to Steven J. Leon for his book Linear Algebra With Applications, which I referred to extensively while writing this post.