Question: Part I: Building Linear Regression Models using Sparks ML Library In this part, you will build a regression model using Sparks ML library and Databricks
Part I: Building Linear Regression Models using Sparks ML Library In this part, you will build a regression model using Sparks ML library and Databricks community edition to run Spark. Steps: Download the Abalone dataset from the LibSVM website https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression.htmlLinks to an external site. If you would like to learn more about this dataset, check the original source of the dataset hereLinks to an external site.. According to their description, this dataset is used for "Estimating the age of abalone from physical measurements." More information can be found on the original dataset page. This dataset is used for regression analysis where the abalone's various physical measurements are used to estimate the abalone's age. After you download the dataset, follow these steps: add a .txt extension to the extracted file in order to be able to open it in a text editor. Do NOT use Notepad. Use a more sophisticated editor such as Sublime to open the file. Each object in this dataset file represents an abalone. An abalone is described by a set of physical attributes (or features), and an age value. Open this dataset using sublime, or any other text editor. Each line in the file represents a patient in this format: :< attribute value> :< attribute value> For example: 1.0708 1:12.3 2:23 3:154.25 4:67.75 5:36.2 6:93.1 7:85.2 8:94.5 9:59.0 Go to your Community Edition Databricks account and create a new workspace & name it LinReg. Upload the abalone dataset textfile to databricks by using the Data icon on the left menu. When you upload your file, make sure that you copy the file location on the databricks S3 file system. It should look something like this: /FileStore/tables/abalone.txt Copy that path and save it in a separate textfile for now. You will need to copy it into the linear regression code later. Go to the Spark ML online documentation part about Regression. It contains sample code for how to create a Regression model. https://spark.apache.org/docs/2.2.0/ml-classification-regression.html#linear-regression (Links to an external site.) Choose the code for the language that you prefer to work with. Copy that code and paste it into the databaricks workspace that you have just created. Modify the code by removing the name of the file: data/mllib/sample_linear_regression_data.txt and replacing it with the full path to the dataset file that you saved in step 3. Create a cluster, attach it to your workspace/notebook, and run the code! What is the RMSE of the regression model that the code built? Is it high or low? Research online for the meaning of RMSE, compare it to the SSE that was discussed in the lesson slides. Notice that to tell whether the RMSE is high or low, you need to compare it to the range of values that you are trying to estimate. For example, for the abalone dataset, the age of an abalone can be 7 or 15 or 2. Find the full range of abalone ages and compare the RMSE to that range. Then you can decide whether the RMSE is low or high. That's because the RMSE, like the SSE, measures the error in estimating the abalone's age. On the top menu, click on your cluster name -> View Spark UI option. This will show you details about how Spark ran your code. Check all the top tabs in the Spark UI. These are Jobs, Stages, Storage, etc. How many jobs did Spark submit to run your code? How many blocks was the RDD holding the dataset divided into? Take screenshots for this information once you find it in the various tabs. Take screen shots of the code in your notebook, the obtained results, and the Spark GUI tabs containing the answers to the questions in Step 10. Also, do not forget to answer the question in Step 8.
Step by Step Solution
There are 3 Steps involved in it
Get step-by-step solutions from verified subject matter experts
