Question: python using pyspark question Alice is a data scientist at Networkly, a large social networking website where users can follow each other. Alice has a

python using pyspark question

python using pyspark question Alice is a data scientist at Networkly, a

large social networking website where users can follow each other. Alice has

Alice is a data scientist at Networkly, a large social networking website where users can follow each other. Alice has a deadline in three hours, and needs your help running an analysis. Suppose you are given the Networkly follow graph. Assumptions you may make: The graph has 330 million nodes and 100 billion directed edges. . The most followed user on Networkly has 100 million followers. . Your computing cluster has 1,000 computers, each with 64 GB of RAM. Using Spark, write a parallel program to compute the in-degree distribution of the graph. That is, for each k, your program should compute the number of users who have k followers (unless this number is zero, in which case you do not need to include that count in the final output) Your program should effectively take advantage of parallel processing and not attempt to read 100 billion directed edges on a single machine. Your input is an RDD, where each entry is an edge: (u, v) is an entry in the RDD if u follows v. Your function should return another RDD, where the entries are of the form (k, number of users who have k followers). What to submit: Please fill in the code stubs provided for you on the next page; you may choose to fill in the stub for either Python, Java, or Scala. You should write real code (rather than pseudocode). However, because you do not have access to a compiler/interpreter, it will be possible to get full credit even if your code does not compile. Specifically, the grading rubric will be (3 points) A correct pseudocode description of the algorithm, without any code. (4 points) Real code that partially works, and corresponds to a correct high-level algorithm (5 points) Real code that completely works, minus one or two missing parentheses. Python version import sys from pyspark import SparkConf, SparkContext # # Input: An RDD containing entries of the form (source node, destination node) # Output: An RDD containing entries of the form # (k, number of users who have k followers) # TODO: Write your answer code in this function. def degree_distribution(edges): return distribution if __name__ == '__main__': conf = SparkConf () SC = SparkContext(conf=conf) sc.setLogLevel ("WARN") # Reads input and converts all node ids to integers data = sc.textFile(sys.argv[1]).map(lambda line: map(int, line.split())) # Computes the degree distribution distribution = degree_distribution (data) # Writes the output to a file distribution.sortByKey().saveAsTextFile(sys.argv[2]) Sc.stop() Alice is a data scientist at Networkly, a large social networking website where users can follow each other. Alice has a deadline in three hours, and needs your help running an analysis. Suppose you are given the Networkly follow graph. Assumptions you may make: The graph has 330 million nodes and 100 billion directed edges. . The most followed user on Networkly has 100 million followers. . Your computing cluster has 1,000 computers, each with 64 GB of RAM. Using Spark, write a parallel program to compute the in-degree distribution of the graph. That is, for each k, your program should compute the number of users who have k followers (unless this number is zero, in which case you do not need to include that count in the final output) Your program should effectively take advantage of parallel processing and not attempt to read 100 billion directed edges on a single machine. Your input is an RDD, where each entry is an edge: (u, v) is an entry in the RDD if u follows v. Your function should return another RDD, where the entries are of the form (k, number of users who have k followers). What to submit: Please fill in the code stubs provided for you on the next page; you may choose to fill in the stub for either Python, Java, or Scala. You should write real code (rather than pseudocode). However, because you do not have access to a compiler/interpreter, it will be possible to get full credit even if your code does not compile. Specifically, the grading rubric will be (3 points) A correct pseudocode description of the algorithm, without any code. (4 points) Real code that partially works, and corresponds to a correct high-level algorithm (5 points) Real code that completely works, minus one or two missing parentheses. Python version import sys from pyspark import SparkConf, SparkContext # # Input: An RDD containing entries of the form (source node, destination node) # Output: An RDD containing entries of the form # (k, number of users who have k followers) # TODO: Write your answer code in this function. def degree_distribution(edges): return distribution if __name__ == '__main__': conf = SparkConf () SC = SparkContext(conf=conf) sc.setLogLevel ("WARN") # Reads input and converts all node ids to integers data = sc.textFile(sys.argv[1]).map(lambda line: map(int, line.split())) # Computes the degree distribution distribution = degree_distribution (data) # Writes the output to a file distribution.sortByKey().saveAsTextFile(sys.argv[2]) Sc.stop()

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer

Step: 1 Unlock blur-text-image

Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock

Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!

1 Computing Degree Distributions on Networkly (5 points) Alice is a data scientist at Networkly, a large social networking website where users can follow each other. Alice has a deadline in three...

There are two problems due this week (each worth 35 points) as follows. Case 5-1David L. Miller: Portrait of a White-Collar Criminal (page 144). In comprehensive paragraphs, answerrequirements 1?6....

PROJ6000: Principles of Project Management Assessment 3 - Project Charter Report. Length 2,000 words (+/- 10%) Task Summary After reading the project case study, use it to develop a 2,000-word...

GRADUATE CERTIFICATE IN PROJECT MANAGEMENT PROJ5010: PROJECT PROCUREMENT AND STRATEGIC SOURCING. CASE STUDIES CONTENTS 1. Proj5010: The World Bank RFP Case Study covers 1. Assignment 1: Marks = 5 2....

Please Read!!! Please DO NOT copy and past from internet or another student or somewhere else, there is a safe assign program will find any plagiarism...please read the book and write an assignment...

this is my assessment which are am going to send you and i need some things about my assessment : Adding some more detail and diving into the case study a bit deeper would really make your points...

PROJ6000: Principles of Project Management Assessment Assessment 2 - this is actually a case study and its called a Individual Report: PMBoK versus PRINCE2 or Agile in contemporary projects . so one...

1. Read the case study below. This will form the basis for your Project Charter, because you will assume that you are the project manager for this project. 2. After reading the case study, begin to...

A A rigid bar AB hinged at A and attached to brass bar as shown below. If temperature of brass bar decreases by 30C. The compression at joint B is (Take E=0.9x105 N/mm and =2010/K for brass). 700 kN...

Explain how and why distribution channels are affected as they are when the stage of development of an economy improves.

Deregulation of financial markets in the 1 9 9 0 s contributed to the 2 0 0 8 financial crisis. Group of answer choices True False

P-1) (100 Pts.) A chemical manufacturing company (CMC) has a contract for the procurement of the neccssaly chemicals from four suppliers. The chemicals purchased from Supplier A are priced at $20...

Discuss four guidelines for effective multicultural communication. (Objective 3)

What do you consider the most important values passed on to you from your parents and grandparents? (Objective 1)

Technology.Send a brief e-mail to your instructor explaining why you agree or disagree with the statement Jargon is technical slang. (Objective 5)