Question: Please do everything as said is simple but need help with it. Part 1: Spark Setup In this exercise you will setup a Ubuntu virtual
Please do everything as said is simple but need help with it.
Part 1: Spark Setup In this exercise you will setup a Ubuntu virtual machine and install Spark on it.
Download and install virtual box and ubuntu from the following sites as we did in the class.
https://www.virtualbox.org/wiki/Downloads https://www.ubuntu.com/download/desktop
Once the installation is complete you will need to install latest version of java. Issue the following commands
sudo apt-get update
sudo apt-get install default-jre
after installation is done check the version using the following command
java -version
You need to install scala https://downloads.lightbend.com/scala/2.12.3/scala-2.12.3.tgz . It will be downloaded into Downloads folder.
Decompress the tgz archive using the following command
tar -xvzf scala-2.12.3.tgz
file will be decompressed to scala-2.12.3 folder. Move this folder to /usr/local/scala folder using the following command.
sudo mv scala-2.12.3 /usr/local/scala
You need to set the PATH environment variable to the scala binary using the following command
export PATH=$PATH:/usr/local/scala/bin
test that installation is successful by checking the version
scala -version
Now install spark by downloading it from https://d3kbcqa49mib13.cloudfront.net/spark-2.2.0-bin- hadoop2.7.tgz
Decompress it using
tar -xvzf spark-2.2.0-bin-hadoop2.7.tgz
and move it to /usr/local/spark folder using the following command
sudo mv spark-2.2.0-bin-hadoop2.7 /usr/local/spark
Finally set the path variable
export PATH=$PATH:/usr/local/spark/bin
now issue the following command to check installation was successful.
spark-shell
It will take some time but you should see some messages and screen art saying spark version 2.2.0 and giving you prompt scala>
Part2: Using Spark to work with Dataset
For this exercise please read chapter2 of the text book and use the dataset available at
http://bit.ly/1Aoywaq.
Using the dataset complete the following tasks. 1. Please create a raw RDD for all the CSV files 2. Please remove all headers from the RDD 3. Please convert each record in the RDD to a case class record 4. Please sample 20 records from the RDD.
Step by Step Solution
There are 3 Steps involved in it
Get step-by-step solutions from verified subject matter experts
