Question: can please someone help me with this assignment python Part 3. PySpark Orientation (functional Programming Examples and Tasks) Sparks shell provides a simple way to

can please someone help me with this assignment python

Part 3. PySpark Orientation (functional Programming Examples and Tasks)

Sparks shell provides a simple way to learn the API, as well as a powerful tool to analyze data interactively. It is available in either Scala (which runs on the Java VM and is thus a good way to use existing Java libraries) or Python. Since this is Python course we will start the Python shell by running the following in the Spark directory:

3.1 Word searching

./bin/pyspark

>>>textFile = sc.textFile(README.md) # can be some file on your system

>>>linesWithSpark = textFile.filter(lambda line: "Spark" in line)

>>>textFile.filter(lambda line: "Spark" in line).count()

# How many lines contain Spark"?

Put your answer here

3.2 Word counting: Lets find the line with the most words:

>>> textFile.map(lambda line: len(line.split())).reduce(lambda a, b: a if (a > b) else b)

Put your answer here

3.3 Define your own max function (same as 3.2)

>>> def max(a, b):

... if a > b:

... return a

... else:

... return b

...

>>> textFile.map(lambda line: len(line.split())).reduce(max)

Put your answer here

3.4 Word count MapReduce Example One common data flow pattern is MapReduce. Here, we combined the flatMap, map and reduceByKey transformations to compute the per-word counts in the file as an RDD of (string, int) pairs. To collect the word counts in our shell, we can use the collect action:

>>>wordCounts = textFile.flatMap(lambda line: line.split()).map(lambda word: (word, 1)).reduceByKey(lambda a, b: a+b)

>>>wordCounts.collect()

>>>wordCounts.count()

Put your answer here

3.5 Word count App

Using the following command to run a simple wordcount app in Spark

./bin/spark-submit examples/src/main/python/wordcount.py

Put your answer here

3.6 Word count program performance evauation.

We provide three different ways to do the frequency counting in Python. You need to compare the output and report if they return the same results. If not, try to explain why.

Method A, Using Python built-in collector

import re, collection.Counter

words = re.findall(r'\w+', open('hamlet.txt').read().lower())

Counter(words).most_common(10) # top-ten most common word

Method B Using Python dictionary

file=open("hamlet.txt","r+")

wordcount={}

for word in file.read().split():

if word not in wordcount:

wordcount[word] = 1

else:

wordcount[word] += 1

newdict = sorted(wordcount.items(), key=operator.itemgetter(1))

print [i for i in newdict[::-1][:10]] # reverse the order

Method C Using Spark

./bin/pyspark

from pyspark import SparkContext, SparkConf

count =

sc.textFile('hamlet.txt').flatMap(lambda line: line.split()).map(lambda word: (word, 1)).reduceByKey(lambda a, b: a + b)

output = count.map(lambda (k,v): (v,k)).sortByKey(False).take(10)

print ["%s: %i" % (value, key) for key, value in output]

Put your answer here

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!