Question: Since this is a big data course, the project needs to demonstrate features of a big data processing system. It could be a simple re
Since this is a big data course, the project needs to demonstrate features of a big data processing system. It could be a simple reimplementation of a machine learning pipeline in scikitlearnPandas but the data loadingprocessing pipeline needs to be within the Apache Spark
or other alternative big data processing systems API call. While the data source could be a
textfile sitting on the Cloud for your project, having the entire pipeline in Spark gives the project a
flexibility to handle other interesting data sources eg streaming sources such as Apache Kafka
In addition, while you can reuse the data discussed in the lecture or in the homework,
you are required to provide additional work to demonstrate your understanding of big data
analytics. For example, if you are using wordcount program as a starting point, think about what
can be counted in the context of data eg bigrams, trigrams,
Step by Step Solution
There are 3 Steps involved in it
1 Expert Approved Answer
Step: 1 Unlock
Question Has Been Solved by an Expert!
Get step-by-step solutions from verified subject matter experts
Step: 2 Unlock
Step: 3 Unlock
