-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy path98-pagerank.Rmd
362 lines (277 loc) · 18.4 KB
/
98-pagerank.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
# PageRank
## Motivation
Imagine a web surfer who moves from page to page by clicking on links randomly with uniform probability. Suppose that the surfer is confined to a set of five pages defined by the following graph, where a link from page $i$ to page $j$ is represented by an arrow pointing from $i$ to $j$.
```{r pagerank-ex-graph, echo = FALSE, message = FALSE, fig.align = 'center'}
library(fs)
library(igraph)
library(Matrix)
library(purrr)
library(readr)
A <- matrix(
c(
0, 1, 1, 1, 0,
0, 0, 0, 1, 0,
0, 0, 0, 0, 1,
1, 0, 1, 0, 1,
0, 0, 0, 1, 0
),
nrow = 5,
byrow = TRUE
)
# for consistent plotting
gr <- igraph::graph_from_adjacency_matrix(A, mode = "directed")
par(mar = rep(0.1, 4))
set.seed(123)
plot(gr)
```
If the surfer is currently on page $i$, we can represent the probability that the surfer moves to page $j$ by the $\left(i,j\right)\text{th}$ entry of the following _transition probability matrix_:
$$
\mathbf{A}=
\begin{bmatrix}
0 & \frac{1}{3} & \frac{1}{3} & \frac{1}{3} & 0\\
0 & 0 & 0 & 1 & 0\\
0 & 0 & 0 & 0 & 1\\
\frac{1}{3} & 0 & \frac{1}{3} & 0 & \frac{1}{3}\\
0 & 0 & 0 & 1 & 0
\end{bmatrix}
$$
```{proposition}
Let $\mathbf{A}$ be a transition probability matrix as above.
1. Let $\boldsymbol{\lambda}\in\mathbb{R}^{n}$ be the eigenvalues of $\mathbf{A}^{\mathsf{T}}$. Then, $\max_{i\in\left\{1,\ldots,n\right\}}\left |\lambda_{i}\right |=1$.
2. Let $\mathbf{v}$ be the eigenvector with eigenvalue 1. Then, $v_{i}\geq 0$.
3. If a web surfer surfs for a long time, then $P\left(\text{surfer ends on page }i\right)=v_{i}$, assuming $\sum_{i}v_{i}=1$.
```
In Google's PageRank algorithm, described in @page1999, the page rank is determined by the values of $v_{i}$ in decreasing order.
```{definition}
For a given matrix, the eigenvalue with maximum absolute value is called the *dominant eigenvalue,* and the associated eigenvector is the *dominant eigenvector.*
```
To obtain a page's rank, we must compute the dominant eigenvector and eigenvalue.
## Computing eigenpairs
We now consider the problem of computing eigenvalues and eigenvectors. Recall that a scalar $\lambda$ is an eigenvalue of an $n\times n$ matrix $\mathbf{A}$ if and only if it satisfies the _characteristic equation_ $\det\left(\mathbf{A}-\lambda\mathbf{I}_{n}\right)=0$, where $\mathbf{I}_{n}$ is the $n$-dimensional identity matrix and $\det\left(\mathbf{M}\right)$ is the determinant of $\mathbf{M}$. If the characteristic equation is a polynomial of degree 4 or less, then an explicit algebraic solution may be obtained. If the characteristic equation is of degree 5 or higher, than an algebraic solution is impossible, and the eigenvalues must be approximted by numerical methods.
In general, the computational complexity of computing the eigenvectors of a (square) $n$-dimensional matrix, e.g., by the _QR algorithm_, is $O\left(n^{3}\right)$, i.e., cubic in the dimension of the matrix. It is clear that for a network of even modest size, e.g., $10^{5}$, this problem will not be tractable. If we reduce the problem to computing the dominant eigenvector rather than all eigenvectors, then we can use a more efficient algorithm.
## Algorithm
We now present _power iteration_. Let $\mathbf{A}$ be an $n\times n$ matrix.
1. Choose $\mathbf{v}^{\left(0\right)}\in\mathbb{R}^{n}$ such that $\mathbf{v}^{\left(0\right)}\neq\mathbf{0}$, and choose $\epsilon>0$.
2. While $\left\Vert \mathbf{v}^{\left(i\right)}-\mathbf{v}^{\left(i-1\right)}\right\Vert \geq \epsilon$
a. $\mathbf{v}^{\left(i\right)}\gets\mathbf{A}\mathbf{v}^{\left(i-1\right)}$
b. $\mathbf{v}^{\left(i\right)}\gets\mathbf{v}^{\left(i\right)}/\sum_{j}v_{j}^{\left(i\right)}$
When the algorithm terminates, $\mathbf{v}^{\left(i\right)}$ will be (approximately) the dominant eigenvector of $\mathbf{A}$. We can then compute the dominant eigenvalue $\lambda_{\max}$ by the _Rayleigh quotient_, i.e.,
$$
\lambda_{\max}=\dfrac{\left(\mathbf{v}^{\left(i\right)}\right)^{\mathsf{T}}\mathbf{A}\mathbf{v}^{\left(i\right)}}{\left\Vert \mathbf{v}^{\left(i\right)}\right\Vert_{2}^{2}}.
$$
```{proof}
We now present a proof sketch for power iteration.
Let $\left\{\lambda_{i}\right\}_{i=1}^{n}$ be the eigenvalues of $\mathbf{A}$ such that $\left|\lambda_{1}\right|>\left|\lambda_{2}\right|>\cdots>\left|\lambda_{n}\right|$, and let $\left\{\mathbf{w}^{\left(i\right)}\right\}_{i=1}^{n}$ be the corresponding eigenvectors.
Suppose that the $\mathbf{w}^{\left(i\right)}$ form a basis for $\mathbb{R}^{n}$, and suppose that we begin power iteration with some $\mathbf{v}^{\left(0\right)}\in\mathbb{R}^{n}$. Because the $\mathbf{w}^{\left(i\right)}$ span $\mathbb{R}^{n}$, we can express $\mathbf{v}^{\left(0\right)}$ as a linear combination of the $\mathbf{w}^{\left(i\right)}$, i.e.,
$$
\mathbf{v}^{\left(0\right)}=\sum_{i=1}^{n}c_{i}\mathbf{w}^{\left(i\right)}
$$
where $\left\{c_{i}\right\}_{i=1}^{n}\in\mathbb{R}^{n}$. At the first iteration of the algorithm, we left-multiply $\mathbf{v}^{\left(0\right)}$ by $\mathbf{A}$, which we can write as
$$
\mathbf{v}^{\left(1\right)}=
\mathbf{A}\mathbf{v}^{\left(0\right)}=
\sum_{i=1}^{n}c_{i}\mathbf{A}\mathbf{w}^{\left(i\right)}=
\sum_{i=1}^{n}c_{i}\lambda_{i}\mathbf{w}^{\left(i\right)}
$$
where the final equality follows because $\mathbf{w}^{\left(i\right)}$ is an eigenvector of $\mathbf{A}$. Observe that if $\mathbf{v}$ is an eigenvector of $\mathbf{A}$, then $c\mathbf{v}$ is also an eigenvector of $\mathbf{A}$ for some $c\in\mathbb{R}$. We have $c=\sum_{j}v_{j}^{\left(i\right)}$, so without loss of generality we will omit the normalization step of the algorithm.
At the second iteration, we left-multiply $\mathbf{v}^{\left(1\right)}$ by $\mathbf{A}$, which we can write as
$$
\mathbf{v}^{\left(2\right)}=
\mathbf{A}\mathbf{v}^{\left(1\right)}=
\mathbf{A}\mathbf{A}\mathbf{v}^{\left(0\right)}=
\mathbf{A}^{2}\mathbf{v}^{\left(0\right)}=
\sum_{i=1}^{n}c_{i}\lambda_{i}\mathbf{A}\mathbf{w}^{\left(i\right)}=
\sum_{i=1}^{n}c_{i}\lambda_{i}^{2}\mathbf{w}^{\left(i\right)}.
$$
We can see that $\mathbf{v}^{\left(M\right)}$ will have the form
$$
\mathbf{v}^{\left(M\right)}=
\mathbf{A}^{M}\mathbf{v}^{\left(0\right)}=
\sum_{i=1}^{n}c_{i}\lambda_{i}^{M}\mathbf{w}^{\left(i\right)}=
c_{1}\lambda_{1}^{M}\mathbf{w}^{\left(1\right)}+
c_{2}\lambda_{2}^{M}\mathbf{w}^{\left(2\right)}+\cdots+
c_{n}\lambda_{n}^{M}\mathbf{w}^{\left(n\right)}.
$$
We can factor out $\lambda_{1}^{M}$ to give
$$
\mathbf{v}^{\left(M\right)}=
\lambda_{1}^{M}\left(
c_{1}\mathbf{w}^{\left(1\right)}+
c_{2}\left(\dfrac{\lambda_{2}}{\lambda_{1}}\right)^{M}\mathbf{w}^{\left(2\right)}+\cdots+
c_{n}\left(\dfrac{\lambda_{n}}{\lambda_{1}}\right)^{M}\mathbf{w}^{\left(n\right)}
\right)
$$
By assumption, $\lambda_{1}$ is the largest eigenvalue of $\mathbf{A}$, so that
$$
\left|\dfrac{\lambda_{i}}{\lambda_{1}}\right|<1,\quad i\in\left\{2,\ldots,n\right\}.
$$
Thus,
$$
M\rightarrow\infty\implies\left(\dfrac{\lambda_{i}}{\lambda_{1}}\right)^{M}\rightarrow 0,
$$
so that for some large but finite $M$,
$$
\mathbf{v}^{\left(M\right)}\approx
\lambda_{1}^{M}\left(
c_{1}\mathbf{w}^{\left(1\right)}+
c_{2}\cdot 0\cdot\mathbf{w}^{\left(2\right)}+\cdots+
c_{n}\cdot 0\cdot\mathbf{w}^{\left(n\right)}
\right)=
\lambda_{1}^{M}c_{1}\mathbf{w}^{\left(1\right)}.
$$
Let $\tilde{\mathbf{w}}^{\left(1\right)}=\lambda_{1}^{M}c_{1}\mathbf{w}^{\left(1\right)}$, so that $\mathbf{v}^{\left(M\right)}\approx\tilde{\mathbf{w}}^{\left(1\right)}$. The final step of the algorithm is to normalize $\mathbf{v}^{\left(M\right)}$, i.e.,
$$
\mathbf{v}^{\left(M\right)}\approx
\dfrac{\tilde{\mathbf{w}}^{\left(1\right)}}{\sum_{j}\tilde{w}_{j}^{\left(1\right)}}.
$$
Thus, we see that $\mathbf{v}^{\left(M\right)}$ approximates the dominant eigenvector $\mathbf{w}^{\left(1\right)}$ of $\mathbf{A}$, completing the proof sketch.
```
## Considerations
### Connection to Markov chains
We can model the web surfer's behavior by a Markov chain with transition probability matrix $\mathbf{A}$. Suppose that $\boldsymbol{\pi}$ is the stationary distribution of the chain, so that
$$
\boldsymbol{\pi}^{\mathsf{T}}\mathbf{A}=\boldsymbol{\pi}^\mathsf{T}\implies
\mathbf{A}^\mathsf{T}\boldsymbol{\pi}=\boldsymbol{\pi}.
$$
We can thus view power iteration as finding the stationary distribution of the Markov chain, the $j\text{th}$ element of which can be interpreted as the long-run proportion of time the chain spends in state $j$. Observe also that power iteration normalizes $\mathbf{v}^{\left(i\right)}$ such that its components sum to 1, as required for a probability distribution.
### Calculating the dominant eigenvalue
We stated above that we can find the eigenvalue corresponding to $\mathbf{w}^{\left(1\right)}$ by the Rayleigh quotient. We have
$$
\dfrac{\left(\mathbf{w}^{\left(1\right)}\right)^{\mathsf{T}}\mathbf{A}\mathbf{w}^{\left(1\right)}}{\left\Vert\mathbf{w}^{\left(1\right)}\right\Vert_{2}^{2}}=
\dfrac{\lambda_{1}\left(\mathbf{w}^{\left(1\right)}\right)^{\mathsf{T}}\mathbf{w}^{\left(1\right)}}{\left\Vert\mathbf{w}^{\left(1\right)}\right\Vert_{2}^{2}}=
\lambda_{1}\dfrac{\left\Vert\mathbf{w}^{\left(1\right)}\right\Vert_{2}^{2}}{\left\Vert\mathbf{w}^{\left(1\right)}\right\Vert_{2}^{2}}=
\lambda_{1},
$$
as desired.
### Computational complexity
At each step of power iteration, we must compute $\mathbf{A}\mathbf{v}^{\left(i-1\right)}$. The first entry of this product is obtained by summing the product of the first row of $\mathbf{A}$ with $\mathbf{v}^{\left(i-1\right)}$, which requires $n$ multiplications. We have $\mathbf{A}\sim n\times n$, i.e., we must repeat this process $n$ times, so that computing $\mathbf{A}\mathbf{v}^{\left(i-1\right)}$ requires $n^{2}$ multiplications.
Thus, each step of power iteration has $O\left(n^{2}\right)$ complexity. While this is an improvement over cubic complexity, it is still intractable for real-world problems, e.g., if we assume that the Internet has on the order of one billion web pages, i.e., $n=10^{9}$, then power iteration requires $10^{18}$ multiplications _at each step_. If we suppose that a typical laptop computer can execute on the order of one billion operations per second, then one step of power iteration on the Internet would require $10^{18}/10^{9}=10^{9}$ seconds, or roughly 32 years (to say nothing of the memory requirements).
Observe that the quadratic complexity of a step of power iteration arises from our assumption that each component of $\mathbf{A}\mathbf{v}^{\left(i-1\right)}$ must be computed. If we _knew in advance_ the result of certain of these multiplications, we might not have to perform them. In particular, if we knew that the $\left(i,j\right)\text{th}$ entry of $\mathbf{A}$ were zero, then we would also immediately know that the $j\text{th}$ summand in the product of the $i\text{th}$ row of $\mathbf{A}$ and $\mathbf{v}^{\left(i-1\right)}$ is zero, and we could avoid doing that multiplication.
_Sparse matrices_ enable precisely this kind of savings: they represent matrices in such a way as to avoid multiplications by zero. It remains to consider the characteristics of the network of interest, so that we might determine whether a sparse representation would be advantageous. Suppose that each web page links to on the order of 10 other pages. In this case, only 10 entries of each row of $\mathbf{A}$ are nonzero, and the remaining $10^{9}-10$ entries are zero. We must perform just 10 multiplications per row, and $\mathbf{A}$ has $10^{9}$ rows, so that a single step of power iteration requires just $10^{10}$ multiplications, or roughly 10 seconds at $10^{9}$ operations per second.
### Convergence
The approximation $\mathbf{v}^{\left(M\right)}\approx\mathbf{w}^{\left(1\right)}$ depends on the terms involving the other eigenvectors $\left\{\mathbf{w}^{\left(i\right)}\right\}_{i=2}^{n}$ shrinking toward the zero vector. The rate at which the $i\text{th}$ term converges to the zero vector is given by the ratio of $\lambda_{i}$ to $\lambda_{1}$. We assume that the eigenvalues are ordered by absolute value, hence the largest such term is $\lambda_{2}/\lambda_{1}$, and this term will determine the rate of convergence of power iteration, e.g., if this ratio is close to 1, then the algorithm may converge slowly.
For large matrices, $\lambda_{2}/\lambda_{1}$ will be close to 1. Google dealt with the slow convergence by modifying the web surfer model. First, choose $p\in\left[0,1\right]$ (Google originally chose $p\approx0.15$). Then, assume that with probability $p$ the web surfer randomly, with uniform probability, jumps to any page in the network given in $\mathbf{A}$ and with probability $\left(1-p\right)$ the surfer randomly, with uniform probability, jumps to a page with a link given in the current page. Thus, rather than surfing behavior governed exclusively by the probability transition matrix, the surfer may also make a random jump to any page in the network. We can represent this new behavior by replacing $\mathbf{A}$ by
$$
\mathbf{M}=\left(1-p\right)\mathbf{A}+p\mathbf{B},
$$
where $\mathbf{B}$ is a matrix with all entries given by $1/n$, which $n$ is again the number of pages in the network. Observe that if $p=0$ this is equivalent to the original model.
The surfer's behavior under this model can similarly be modeled by a Markov chain whose stationary distribution $\mathbf{v}$ satisfies $\mathbf{M}^{\mathsf{T}}\mathbf{v}=\mathbf{v}$. We can again approximate the stationary distribution by power iteration. We must be careful in implementation because while $\mathbf{A}$ is typically sparse, $\mathbf{M}$ never is ($\mathbf{B}$ has all non-zero entries). It turns out that we can decompose $\mathbf{M}^{\mathsf{T}}\mathbf{v}$ as
$$
\mathbf{M}^{\mathsf{T}}\mathbf{v}=
\left(1-p\right)\mathbf{A}^{\mathsf{T}}\mathbf{v}+\dfrac{p}{n}\mathbf{1}\sum_{i=1}^{n}v_{i},
$$
where $\mathbf{1}$ is the $n$-dimensional vector of ones. This decomposition allows us to compute the dominant eigenvector of the sparse matrix $\mathbf{A}^\mathsf{T}$ rather than the dense matrix $\left(\left(1-p\right)\mathbf{A}+p\mathbf{B}\right)^{\mathsf{T}}$, with the accompanying decrease in computational complexity. We now implement power iteration for this model.
```{r}
#' Modified power iteration
#'
#' @param A adjacency matrix of original network
#' @param v0 starting vector
#' @param p probability of jumping to any page in `A`
#' @param tol iteration tolerance
#' @param niter maximum number of iterations
#'
#' @return list containing the number of iterations to converge and the
#' dominant eigenvector of `A`
power_iter <- function(A, v0, p, tol = 1e-4, niter = 1e3) {
mag <- function(x) sqrt(sum(x ^ 2))
# normalize v0 so that convergence can occur in a single iteration
v_old <- v0 / sum(v0)
i <- 0
delta <- tol + 1
# this term does not change from iteration to iteration
u <- (p / length(v0)) * rep(1, length(v0))
while (i < niter && delta > tol) {
v_new <- (1 - p) * crossprod(A, v_old) + sum(v_old) * u
v_new <- v_new / sum(v_new)
delta <- mag(v_new - v_old)
v_old <- v_new
i <- i + 1
}
list(
niter = i,
v = as.vector(v_new)
)
}
```
We will test our implementation using the General Relativity network from the Stanford Network Analysis Project.
```{r, message = FALSE}
path <- "data/ca-GrQc.txt.gz"
if (!fs::file_exists(path)) {
download.file(
"http://snap.stanford.edu/data/ca-GrQc.txt.gz",
destfile = path,
quiet = TRUE
)
}
gr <- readr::read_tsv(path, skip = 4, col_names = c("from", "to"))
gr
```
We now implement a function to perform power iteration, measure its runtime, and extract the name of the top node, i.e., $v_{\max}=\max_{j\in\left\{1,\ldots,n\right\}}v_{j}$.
```{r}
#' Power iteration runtime and top node
#'
#' @param A adjacency matrix for power iteration
#' @param p probability of jumping to any page in `A`
#' @param ... other arguments passed to `power_iter()`
#'
#' @return list containing `p`, the name of the top node, the number of
#' iterations required to converge, and the runtime
top_node <- function(A, p, ...) {
runtime <- system.time(
pi_result <- power_iter(
A = A,
v0 = rep(1, nrow(A)),
p = p,
...
)
)
# extract the (first) top-rated node
v_max <- (rownames(A))[which.max(pi_result$v)]
list(
p = p,
top_node = v_max,
niter = pi_result$niter,
time = runtime["elapsed"]
)
}
```
Finally, we implement a function to perform power iteration for multiple values of $p$. Note that `gr` is an edgelist.
```{r}
#' Power iteration for multiple jump probabilities with sparse matrices
#'
#' @param el data frame containing a symbolic edge list in the first two
#' columns; passed to `igraph::graph_from_data_frame()`
#' @param p vector of probabilities
#' @param sparse whether to use sparse matrices
#'
#' @return a tibble containing a row for each value of `p`, the name of the
#' top node, the number of iterations required to converge, and the runtime
pagerank <- function(el, p, sparse = TRUE) {
graph <- igraph::graph_from_data_frame(el)
adj <- igraph::as_adjacency_matrix(graph, sparse = sparse)
purrr::map_dfr(p, top_node, A = adj)
}
```
We are now ready to examine the impact of $p$.
```{r, cache = TRUE}
prob <- c(10 ^ -(6:2), 0.15, 0.5, 0.9, 0.99)
pr_dense <- pagerank(gr, p = prob, sparse = FALSE)
pr_dense
```
We see that the top-rated node is 21012, and that 31 iterations were required for most values of $p$. As $p$ becomes large, the number of iterations required to converge decreases. We now repeat the process with sparse matrices.
```{r, cache = TRUE}
pr_sparse <- pagerank(gr, p = prob)
pr_sparse
```
We see that the same number of iterations are required to converge as in the dense case, but that the time required is decreased by roughly two orders of magnitude. Finally, observe that node 21012 is also the node with the largest number of adjacent vertices (connections) (note that the _ego graph_ of a vertex includes the vertex itself):
```{r}
graph <- igraph::graph_from_data_frame(gr)
igraph::neighbors(graph, v = "21012")
graph %>% igraph::ego_size() %>% max()
```
Finally, observe that PageRank is a variant of [eigenvector centrality](https://en.wikipedia.org/wiki/Eigenvector_centrality):
```{r}
ec <- igraph::eigen_centrality(graph)
ec$vector[which.max(ec$vector)]
```
In eigenvector centrality, the score of a node is increased more by (inbound) connections from high-scoring nodes than from low-scoring nodes. There are several other important types of centrality, e.g., _betweenness centrality_, in which the score of a node is determined by how often it appears in the shortest path between two other nodes (how often it acts as a "bridge").