Question: can please someone help me with this assignment python Part 3. PySpark Orientation (functional Programming Examples and Tasks) Sparks shell provides a simple way to
can please someone help me with this assignment python
Part 3. PySpark Orientation (functional Programming Examples and Tasks)
Sparks shell provides a simple way to learn the API, as well as a powerful tool to analyze data interactively. It is available in either Scala (which runs on the Java VM and is thus a good way to use existing Java libraries) or Python. Since this is Python course we will start the Python shell by running the following in the Spark directory:
3.1 Word searching
./bin/pyspark
>>>textFile = sc.textFile(README.md) # can be some file on your system
>>>linesWithSpark = textFile.filter(lambda line: "Spark" in line)
>>>textFile.filter(lambda line: "Spark" in line).count()
# How many lines contain Spark"?
Put your answer here
3.2 Word counting: Lets find the line with the most words:
>>> textFile.map(lambda line: len(line.split())).reduce(lambda a, b: a if (a > b) else b)
Put your answer here
3.3 Define your own max function (same as 3.2)
>>> def max(a, b):
... if a > b:
... return a
... else:
... return b
...
>>> textFile.map(lambda line: len(line.split())).reduce(max)
Put your answer here
3.4 Word count MapReduce Example One common data flow pattern is MapReduce. Here, we combined the flatMap, map and reduceByKey transformations to compute the per-word counts in the file as an RDD of (string, int) pairs. To collect the word counts in our shell, we can use the collect action:
>>>wordCounts = textFile.flatMap(lambda line: line.split()).map(lambda word: (word, 1)).reduceByKey(lambda a, b: a+b)
>>>wordCounts.collect()
>>>wordCounts.count()
Put your answer here
3.5 Word count App
Using the following command to run a simple wordcount app in Spark
./bin/spark-submit examples/src/main/python/wordcount.py
Put your answer here
3.6 Word count program performance evauation.
We provide three different ways to do the frequency counting in Python. You need to compare the output and report if they return the same results. If not, try to explain why.
Method A, Using Python built-in collector
import re, collection.Counter
words = re.findall(r'\w+', open('hamlet.txt').read().lower())
Counter(words).most_common(10) # top-ten most common word
Method B Using Python dictionary
file=open("hamlet.txt","r+")
wordcount={}
for word in file.read().split():
if word not in wordcount:
wordcount[word] = 1
else:
wordcount[word] += 1
newdict = sorted(wordcount.items(), key=operator.itemgetter(1))
print [i for i in newdict[::-1][:10]] # reverse the order
Method C Using Spark
./bin/pyspark
from pyspark import SparkContext, SparkConf
count =
sc.textFile('hamlet.txt').flatMap(lambda line: line.split()).map(lambda word: (word, 1)).reduceByKey(lambda a, b: a + b)
output = count.map(lambda (k,v): (v,k)).sortByKey(False).take(10)
print ["%s: %i" % (value, key) for key, value in output]
Put your answer here
Step by Step Solution
There are 3 Steps involved in it
Get step-by-step solutions from verified subject matter experts
