De Econometrist neemt een statistische kijk op de wereld.
How did Google become so successful? What did the first search engine look like? The answer to these questions is related to a pretty basic mathematical concept which was revolutionary in that time. In this article I will tell you about the PageRank algorithm, which is the first and best known algorithm that Google uses.
You may have heard the name earlier, Larry Page, he is one of the founders of Google and also one of the richest men in the world. He studied computer science at the University of Michigan and at Stanford University. After that, he started his PhD in computer science and he was looking for a dissertation theme. In that time, the World Wide Web already existed and it was getting bigger and bigger. But a big difference with the World Wide Web of today is that it were basically only loose pages, which had some links to each other but there wasn’t really a structure. Page decided to focus on that and to explore the mathematical properties of the web, understanding its structure as a big graph. Finally, he developed the PageRank algorithm which is also named after him (and not after the English word page as you may have guessed when you started reading this article).
In this algorithm each page corresponds to a different URL. Because of this, most websites contain a lot of pages. But how can we order all these pages? Let us first take a look at the definition which Google itself gives to us:
So to find the order of a few pages, we need to take a look at how the pages are linked with each other.
To see how the pages are related, we will look at an example. Imagine that we are in a really simple web with just 5 webpages of www.deeconometrist.nl. Some of our articles are linked to other articles and that way you can go to another webpage. Let us use a bit of graph theory and let the nodes of the graph be the webpages and let the links between the pages be the arrows. In the picture below we see for example that it is possible to go from page D to page A but that it is not possible to go back (in one step).
Let us now imagine that there is someone who is reading articles at our simple website. We will assume that this person walks randomly through the graph. When this person is for example at webpage B, he can go to page A or to C and both events have a probability of a . Then we can put these probabilities in a matrix, and this gives us the following:
In this matrix, the probability (which is the element in the ith row of the jth column) corresponds to the probability of moving from station to station . The people who already have some basic linear algebra knowledge may have recognized this already, namely, this matrix is also a Markov matrix. This is because of the fact that the elements of each column sum to 1.
Let us now continue with the random walk of our person. We will first introduce the random variable which represents the page where the person is after steps. From our definition of , we know that this is the probability of going to state conditional on being in state , written down mathematically this means:
What is really interesting about this, is that the probability is independent of , it only depends on the current state! Another interesting insight is that the elements of matrix corresponds to the probabilities of being in state after 2 steps conditional on being in state at the start. So these probabilities are:
We will not go into the proof of this result but let us do a check for a simple case.
Consider element (1,1) of the matrix . This tells us that the probability of starting at webpage A and ending up at webpage A after 2 steps. In the first step we can only go to webpage B so this probability is equal to 1. From webpage B we can go to page A or page C, so the probability of ending up at A again after 2 steps is . And, as we see, this holds!
We are now at the point that it gets even more interesting. Let us compute for some big number . For example, take = 32, this gives the probabilities of being in state after 32 steps conditional on being in state at the start.
But what we see now, is that all the columns are actually the same! And when we compute for some bigger than 32 we get that the matrix is still the same. But that seems pretty strange, doesn’t it? This tells us that the probability of being in for example state A after a big amount of steps is 0.293, no matter at which state we start. Hence, after a big amount of steps, the probability of being in a certain state is independent of state where we started! This means that after some time, we reach a steady state which tells us about the fraction of people who visit a certain state in the long-run. Hence, we can now order our pages! So when somebody puts the name of our website in the search engine, the order in which we give him/her the pages is given by B, A, C, E, D.
The previous example showed how a pretty basic result from linear algebra gives us some really useful insights in the fraction of people who visit a certain webpage. When Larry Page developed this algorithm, it was definitely revolutionary. When other people saw how useful this algorithm was, they also wanted to use it but luckily for Larry Page, he already had a patent on it. You can read more about monopolies in the article of Sjors Keet next week, unfortunately it is not online yet so I can’t link to it to improve its PageRank… An interesting fact is that this patent is actually assigned to Stanford University. To get the exclusive license rights, Google paid 1.8 million shares of Google to Stanford University. In 2005, Stanford University sold these shares for an astonishing amount of $336 million!
Of course, the algorithms that Google is using today are way more advanced than the algorithm that we just discussed. But what I personally like the most about this algorithm is that the mathematics which is being used is still pretty basic, however, it led to a huge breakthrough for search engines and to the birth of a commercial empire. This shows us the power of using mathematics outside of their “normal” context!
Dit artikel is geschreven door Stan Koobs