机器学习
Web as a directed graph:
- Nodes: Web pages
- Edges: Hyperlinks
Define a rank rj for page j
rj=∑i→jridi
Let M be a n×n matrix. If i→j, Mij=1di, else Mij=0, and let ri of vector r denote rank of page i. The flow equations can be written as:
r=M⋅r
2 Problems
- Dead End (some pages have no out-links)
- Spider Trap (all out-links are within a group)
Teleports
With probability 1−β jump to some random pages. This solves Spider Trap problem. For those Dead Ends, always teleports! Thus, the rank for page j is
rj=∑i→jβridi+(1−β)1N
The corresponding vectorized form is
r=(βM+(1−β)[1N]N×N)⋅r
Practical Implementation
M is too large to hold in memory.
Sparse matrix
We rearrange the pagerank equation
r=βM⋅r+[1−βN]N
Since M is a sparse matrix, there are many tricks for it. In practice, we might need to replace [1−βN]N with [1−∑jrjN]N, since we have dead ends.
Assume enough RAM to fit rnew into memory, and store rold and M on disk. 1 step of power iteration is:
- Initialize rnew=1−βN;
- For each page i
- Read encoded sparse matrix M(1,:) into memory;
- For j=1…di, update rnew.