Question: Hello, please modify this code: import pyspark sc = pyspark.SparkContext ( ) def NASDAQ ( line ) : try: fields = line.split ( ' ,

Hello, please modify this code:
import pyspark
sc = pyspark.SparkContext()
def NASDAQ(line):
try:
fields = line.split(',')
if len(fields)!=9:
return False
#int(fields[2][:4])
return True
except:
return False
def COMPANYLIST(line):
try:
fields = line.split('\t')
if len(fields)!=5 or ("IPOyear" in line and "Symbol" in line):
return False
return True
except:
return False
#Load files and clean
nasdaq = sc.textFile("/home/kinivera/BigData/Partial2/input/NASDAQsample.csv")
companylist = sc.textFile("/home/kinivera/BigData/Partial2/input/companylist.tsv")
nasdaq = nasdaq.filter(NASDAQ)
companylist = companylist.filter(COMPANYLIST)
nasdaq = nasdaq.map(lambda l: (l.split(',')[1],(l.split(',')[2][:4], int(l.split(',')[7])))) #symbol,(date,n)
companylist = companylist.map(lambda l: (l.split("\t")[0], l.split("\t")[3])) #symbol,sector
joined_rdd = nasdaq.join(companylist) #symbol,((date,n),sector)
#print(joined_rdd.take(10))
features = joined_rdd.map(lambda row: ((row[1][1], row[1][0][0]), row[1][0][1])) #(sector,date), n
# Reduce by key (Year, Sector) by adding the number of operations
sector_counts = features.reduceByKey(lambda x, y: x + y)#[((sector,year),n),....]
#print(sector_counts.take(10))
# Find the sector with the highest number of operations for each year
max_sector_per_year = sector_counts.map(lambda x: (x[0][1],(x[1],x[0][0]))) #[(year,(n,sector)),....]
result = max_sector_per_year.reduceByKey(lambda x, y: x if x[0]> y[0] else y) #[(year,(major n,sector))]
#print(result.take(10))
#order x year
result = result.sortByKey()
# Convert the RDD to a format suitable for saving as text
max_sector_per_year_formatted = result.map(lambda x: (x[1][1],"{},{}".format(x[0], x[1][0])))
# Save the RDD as a text file
max_sector_per_year_formatted = max_sector_per_year_formatted.coalesce(1)
max_sector_per_year_formatted.saveAsTextFile("1_out")
This was the statement given for that exercise: RDD manipulation using transformation and action operations and performance optimization using RDD are evaluated. The execution time is also evaluated
This point takes into account the Nasdaq and companylist datasets. Remember the data format is: For NASDAQ: exchange, stock symbol, date, stock opening price, stock high price, stock low price, stock closing price, stock volume
and adjusted closing price of the stock. For companylist: Symbol, Name, initial public offering year IPOyear and industry sector.
1.Calculate, for each year of the DataSet given for point 1, which sector had the greatest number of operations. The output must mention the year, the name of the sector and the overall value of operations. The result should look like:
Finance,1996,20090342
Pharma,1996,12312312
Finance,1997,25612312
Deliverable 1: spark script where RDD is used to solve the problem, with the
name 1_topsectorperyear.py. the lines must be explained within the script
code fundamentals
Deliverable 2: Output file with the results, with the name 1_out.txt
Now we have to solve this statement of a Big Data exercise and data frames cannot be used.
2.Calculate, for each company and business sector, which company grew the most per year, also listing the percentage of growth. The results should be in a format similar to:
Finance,1996,ABCD,46%
Finance,1997,VFER,64%
Deliverable 3: spark script where RDD is used to solve the problem, with the
name 2_topcompanypersector.py. The fundamental lines of BigData/spark code must be explained within the script
Deliverable 4: Output file with the results, with the name 2_out.txt
This must be done in a Linux virtual machine. The data companylist.tsv has headers Name, IPOyear Sector, and industry. In IPOyear some are with n/a and others with dates of years. The data in NADASQsample.csv has no statements.
\table[[Symbol,Name,IPOyear,Sector,industry],[FLWS,1-800 FLOWERS.COM, Inc.,1999,Consumer Services,Other Specialty Stores],[FCTY,1st Century Bancshares,,n/a,Finance,Major Banks],[FCCY,1st Constitution Bancorp,n/a,Finance,Savings Institutions],[SRCE,1st Source Corporation,n/a,Finance,Major Banks],[FUBC,1st United Bancorp, Inc.,na,Finance,Major Banks],[VNET,21Vianet Group, Inc.,na,Technology,Computer Software:],[SSRX,3SBio Inc.,2007,Consumer Durables,Major Pharmaceuticals],[JOBS,51job, Inc.,2004,Technology,Diversified Commercial],[FGHT,88lnc,na,Dublic lltiliti,T
Hello, please modify this code: import pyspark sc

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Accounting Questions!