Question: Since this is a big data course, the project needs to demonstrate features of a big data processing system. It could be a simple re

Since this is a big data course, the project needs to demonstrate features of a big data processing system. It could be a simple re-implementation of a machine learning pipeline in scikitlearn/Pandas but the data loading/processing pipeline needs to be within the Apache Spark
(or other alternative big data processing systems) API call. While the data source could be a
textfile sitting on the Cloud for your project, having the entire pipeline in Spark gives the project a
flexibility to handle other interesting data sources (e.g. streaming sources such as Apache Kafka).
In addition, while you can re-use the data discussed in the lecture or in the homework,
you are required to provide additional work to demonstrate your understanding of big data
analytics. For example, if you are using word-count program as a starting point, think about what
can be counted in the context of data (e.g. bigrams, trigrams, ...)

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!