Question: Please do everything as said is simple but need help with it. Part 1: Spark Setup In this exercise you will setup a Ubuntu virtual

Please do everything as said is simple but need help with it.

Part 1: Spark Setup In this exercise you will setup a Ubuntu virtual machine and install Spark on it.

Download and install virtual box and ubuntu from the following sites as we did in the class.

https://www.virtualbox.org/wiki/Downloads https://www.ubuntu.com/download/desktop

Once the installation is complete you will need to install latest version of java. Issue the following commands

sudo apt-get update

sudo apt-get install default-jre

after installation is done check the version using the following command

java -version

You need to install scala https://downloads.lightbend.com/scala/2.12.3/scala-2.12.3.tgz . It will be downloaded into Downloads folder.

Decompress the tgz archive using the following command

tar -xvzf scala-2.12.3.tgz

file will be decompressed to scala-2.12.3 folder. Move this folder to /usr/local/scala folder using the following command.

sudo mv scala-2.12.3 /usr/local/scala

You need to set the PATH environment variable to the scala binary using the following command

export PATH=$PATH:/usr/local/scala/bin

test that installation is successful by checking the version

scala -version

Now install spark by downloading it from https://d3kbcqa49mib13.cloudfront.net/spark-2.2.0-bin- hadoop2.7.tgz

Decompress it using

tar -xvzf spark-2.2.0-bin-hadoop2.7.tgz

and move it to /usr/local/spark folder using the following command

sudo mv spark-2.2.0-bin-hadoop2.7 /usr/local/spark

Finally set the path variable

export PATH=$PATH:/usr/local/spark/bin

now issue the following command to check installation was successful.

spark-shell

It will take some time but you should see some messages and screen art saying spark version 2.2.0 and giving you prompt scala>

Part2: Using Spark to work with Dataset

For this exercise please read chapter2 of the text book and use the dataset available at

http://bit.ly/1Aoywaq.

Using the dataset complete the following tasks. 1. Please create a raw RDD for all the CSV files 2. Please remove all headers from the RDD 3. Please convert each record in the RDD to a case class record 4. Please sample 20 records from the RDD.

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!