PageRank Explained with Javascript

See http://williamcotton.com/pagerank-explained-with-javascript from which this was adapted.

    Damping Factor:

The above graphic describes a network of web pages. Each page lists its links to other pages. You can load example networks based on images from the Wikipedia article on PageRank or you can click and then add your own pages and links.

Page 7

0.016

To add a page, click on any empty space in the above page space. This creates a page represented by the page icon you see to the left.

To link from one page to another, click on the page icon that you want to create a link from, followed by the page icon you want to create a link to.

Pages are draggable, so feel free to move them around.

Double-clicking on a page will remove it from the graph.

You'll notice that there are numbers at the bottom of each page icon. This represents the PageRank of each page. It is updated in real time to reflect the changing state of the network. In addition, the rest of the article updates to reflect the changing state of the graph.

If you scroll down the page the bottom portion of the graph remains and sticks to the top of the browser viewport. This allows you to quickly show the graph at any point by clicking . This button only appears once the graph has scrolled out of view.

Keep this dynamic functionality in mind while reading through the article as it may help you visualize what is going on.

What is PageRank?

PageRank relies on the uniquely democratic nature of the web by using its vast link structure as an indicator of an individual page's value. In essence, Google interprets a link from page A to page B as a vote, by page A, for page B. But, Google looks at more than the sheer volume of votes, or links a page receives; it also analyzes the page that casts the vote. Votes cast by pages that are themselves "important" weigh more heavily and help to make other pages "important".

Before we start...

We will need to familiarize ourselves with some of the language used herein.

Here is the resulting x square matrix:

This grid is what we will call the adjacency matrix of the graph that describes our network of web pages.

Now that we have described this network in the language of linear algebra, we need to further translate it in to a computer programming language. Luckily, the majority of languages have linear algebra libraries. In this example we're using a Javascript library called Sylvester. It has a number of classes that allow us to model matrices and vectors in any number of dimensions.

In programming terms, a matrix is similar to an array of arrays.

For the remainder of this article, we will be adding functions to Sylvester's Matrix class.

Row Stochastic Matrix

Since PageRank can also be thought of as a representation of how likely it is that a person randomly clicking on links well end up on a certain page, we need to convert our matrix in to a representation that represents that probability. "Stochastic" and "probability" are interchangeable terms within the realms of our discussion.

For example, since Page has on it, there is a 1 in chance that we will randomly click on any of one of the links.

This type of matrix is known as a stochastic matrix, and in our case, we are looking for a row stochastic matrix. That is, we want every row to add up to 1. We are turning our representation of a network of links to a form that takes in to account the probabilities that a random user will browse to a certain page.

However, what happens if there is a page that acts as a sink, that is, it has no outbound links? How do these dangling pages affect the calculation?

If we randomly stumble across the rare web page that has no links to other pages we will assume that we have equal probability to end up on any of the other web pages in the network. That is, we will assume that there are links to every page from a page that has no links.

In order to be fair to the other pages that do have links, these random transitions are added to all pages in our network, with an additional residual probability called a damping factor.

This damping factor is included to increase the probability that we'll end up at a page with more network centrality and counter some of the abuses that spammers could use to game this algorithm.

Neither adjustments made for dangling pages or damping factors are necessary for computing network centrality, but they are indeed properties of PageRank and are what differentiates and makes the algorithm unique.

Here is our function that computes the row stochastic matrix:

				Matrix.prototype.row_stochastic = function(damping_factor) {

					var row_length = this.elements[0].length;
					var d = (1 - damping_factor) / row_length;

					var row_total = [];

					for (var x = 0; x < row_length; x++) {
						row_total.push(0);
						for (y = 0; y < row_length; y++) {
							row_total[x] += this.elements[x][y];
						}
					}

					var a1 = this.elements.clone();

					for (var x = 0; x < row_length; x++) {
						for (var y = 0; y < row_length; y++) {
							if (row_total[x] > 0) {
								a1[x][y] = a1[x][y]/row_total[x] + d;
							}
							else {
								a1[x][y] = (1/row_length) + d;
							}
						}
					}

					return $M(a1);

				}
			

As we can see on line 4, we compute our additional probabilities of moving from one page to another as a variable called d, which is based on a damping factor and the total number of web pages.

Next, we iterate through each row and store an array of the sum of all elements in the row, as shown on lines 8 - 13.

Finally, we again iterate through each row, this time dividing each element by its row's total, fulfilling our stochastic requirements. We also add our variable d to not only pages that have no outbound links (and therefore are assumed to link to all pages), but to every page, in order to be fair to the pages that do have links. This is done on lines 17 - 26.

When using a damping factor of 0.825 we get the resulting matrix:

Things have very suddenly become very abstract. We've gone from one representation, our graph of a network of web pages, through to a matrix representation of that graph, and ended up with a matrix that takes in to account all sort of peculiar probabilities related to the likelihood of a user ending up on a given page.

It gets confusing because we're representing something that we can think of in a number of ways, as a graph, a matrix, or a set of probabilities. In turn, our use of language jumps between graph language, matrix language, and probability language quite rapidly.

This aspect of mathematics has always been the most difficult to deal with. A lot of the times the principles that are being discussed are easy to understand, but the language that mathematicians use to talk about these principles are arcane to say the least. Don't feel bad. A lot of mathematicians are only fluent in whatever specific branch they are familiar with and have just as hard a time figuring out the languages of branches they are unfamiliar with!

Eigenvector Centrality

This is where things start to get magical. It as also the hardest part to visualize. Every step so far has made a good amount of sense.

PageRank is a measure of network centrality with some tweaks made that are specific to web browsing, as we've discussed above. Since we are pretty much only dealing with matrices at this point we'll refer to this as eigenvector centrality.

This eigenvector centrality is a property of our initial graph of the network of web pages. Our previous steps, along with quick transposition of our matrix, lead us to our penultimate step towards computing PageRank.

What is a vector?

At the most abstract level, a vector is nothing more than a member of a vector space.

A vector in 2D vector space, called a euclidean vector, is a geometric object that has a magnitude and a direction. In the case of 2D computer graphics, the relationship of pixels to one another in an image can be considered vectors.

Vectors are acted upon by matrices. We can also think of a matrix as being somewhat like a function that we can apply to a vector space.

Here is the same 2x2 matrix that we saw above. It is an example of a shear mapping.

Matrices are used for a lot of different purposes. They are heavily used in the realms of 2D and 3D graphics.

Italic fonts are also an example of a shear mapping!

One way that matrices are used is for performing linear transformations.

If you've ever used the transform tool in Photoshop, you are watching linear algebra in operation.

This is the result of applying the above 2-dimensional matrix to a set of pixels in a vector space. Matrices are always applied to vector spaces.

What is an eigenvector?

When we apply our matrix to a set of pixels, it scales all of the pixels to the right by a factor of 1.25. Notice that the horizontal relationships of the pixels do not change at all. Or, in other words, the image is scaled by 1.25 along the x-axis and left the same in the y-axis.

Eigenvectors are the vectors in a space that when acted upon by a matrix have only their magnitude affected, and not their direction. In the case of our sheer mapping, vectors that are perfectly horizontal are eigenvectors, as their direction is not changed. Again, in terms of pixels, the relationship between pixels are not affected in a horizontal direction, but are in all other directions.

Technically, an eigenvector is a vector that satisfies the equation . It is a vector x, that when acted upon by a matrix A, results in the vector x multiplied by a scalar λ. The vector x is the eigenvector and the scalar λ is the associated eigenvalue.

There are an infinite amount of eigenvectors in this case, as all horizontal vectors are eigenvectors. For the remainder of the article we will refer to all eigenvectors with the same direction as just a single eigenvector, or a unit eigenvector. We are interested in finding the unit eigenvectors, of which there will sometimes be many.

However, the matrix that we've constructed is not a 2x2 matrix and it doesn't exist in a 2D vector space. In inhabits a vector space that is purely abstract, so the concept of an eigenvector is harder to visualize.

A vector in a stochastic vector space is called a stochastic or probability vector. It is a vector that adds up to 1 and describes the likelihood of certain events taking place. The eigenvector that we are going to compute is a probability vector. It describes the probabilities of ending up at certain web pages based on the link structure of our network, no matter what page we start on. The concept of direction that we discussed when referring to the above linear transformation loses it's ordinary meaning.

Instead of, as in a 2D space where applying a matrix to a set of vectors alters the literal direction that they point in, we can think of this matrix as "directing" people clicking on links towards certain pages. The eigenprobability of our modified adjacency matrix is the PageRank.

Phew!

Wait, what? What just happened here? Why do we need to use this language? Seriously, why all the talk of graphs and vectors?

Mathematics is made up of a large number of languages. Each language carves out it's own priorities and it's own set of rules. We've got algebra, trigonometry, differentials and integrals, logarithms, and these are just a few of the familiar languages.

The amazing thing is that they can be used with one another. We can find the integral of a function that comes from trigonometry. We can solve for a variable described by an exponent. We can mix and match and reframe our question in an infinite number of ways and the entire time, if we're careful, we can be sure that techniques used and the inherent truths of each language still apply.

Mathematics is almost in one way an unraveling of a tautology. Of course there is a 1/3 chance of picking out any random apple from a group of three of them. Of course there is a direct relationship between the surface area of a sphere and the volume contained by that surface.

Starting with these principles, these almost outright obvious truths, we've discovered certain key principles to the systems that we've invented. We've discovered many ways to manipulate our patterns within these systems, promoting or ignoring certain characteristics to meet the desired goals.

Mathematics is our language of truth. It doesn't lie, yet at the same time, it can be made to model just about anything we can throw at it. It is important to understand that.

Wait, isn't PageRank a search algorithm?

No, its not!

All that PageRank does is order pages by the combined authority of all pages on the Internet. It doesn't have anything to do with keyword lookups, text indexing, or anything else that is pretty much a requirement for searching for things in the way we're used to.

PageRank is just an important aspect of Google's search algorithm and is one of the primary methods related to ordering search results.

How can I use this to drive traffic to my website?

If you take anything from this article take with it the spirit of sharing knowledge. Knowledge is the only type of content that is truly valuable and there should be no doubt that search engines now and of the future will try their best to make sure it ends up at the top of their search results.

Thanks