Question: Goal: In this assignment, we will compute PageRank score for the web dataset provided by Google in a programming challenge in a programming constest in

Goal: In this assignment, we will compute PageRank score for the web dataset provided by Google in a programming challenge in a programming constest in 2002. Input Format: The datasets are given in txt. The file format is:

  • Rows from 1 to 4: Metadata. They give information about the dataset and are self-explained.
  • Following rows: each row consists of 2 values represents the link from the web page in the 1st column to the web page in the 2nd column. For example, if the row is 0 11342, this means there is a directed link from the page id 0 to the page id 11324.

There are two dataset that we will work with in this assignment.

  1. web-Google_10k.txt: This dataset contains 10,000 web pages and 78323 links. The dataset can be downloaded from here. DO NOT assume that page ids are from 0 to 10,000.
  2. web-Google.txt: This dataset contains 875,713 web pages and 5,105,039 links. The dataset can be downloaded from here. DO NOT assume that page ids are from 0 to 875,713.

Also, it's helpful to test your algorithm with this toy dataset. Output Format: the output format for each quesion will be specified below. There are two questions in this assigment worth 50 points total. Question 1 (20 points): Find all dead ends. A node is a dead end if it has no out-going edges or all its outoging edges points to dead ends. For example, consider the graph A->B->C->D. All nodes A,B,C,D are dead ends by this definition. D is a dead end because it has no outgoing edge. C is a dead end because its only out-going neighbor, D, is a dead end. B is a dead end for the same reason, so is A.use python

  1. (10 points) Find all dead ends of the dataset web-Google_10k.txt. For full score, your algorithm must run inless than 15 seconds. The output must be written to a file named deadends_10k.tsv
  2. (10 points) Find all dead ends of the dataset web-Google_800k.txt. For full score, your algorithm must run in less than 1 minute. The output must be written to a file named deadends_800k.tsv

The output format for Question 1 is single column, where each column is the id of an dead end. See here for a sample output for the toy dataset. Question 2 (30 points): Implement the PageRank algorithm for both datasets. The taxation parameter for both dataset is = 0.85 and the number of PageRank iterations is T = 10.

  1. (15 points)Run your algorithm for web-Google_10k.txt dataset. For full score, your algorithm must run in less than 30 seconds. The output must be written to a file named PR_10k.tsv
  2. (15 points)Run your algorithm for web-Google.txt dataset. For full score, your algorithm must run in less than 2 minutes. The output must be written to a file named PR_800k.tsv

The output format for Question 2 is two-column:

  • The first column is the PageRank score.
  • The second column is the corresponding web page id.

The output must be sorted by descending order of the PageRank scores. Here is a sample output for the toy dataset above.

PageRank Ids0.32454706832136704 00.3002013029682813 50.24391355866172854 40.22515097722621097 30.22515097722621097 20.22515097722621097 1

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!