Question: Part I: Building Linear Regression Models using Sparks ML Library In this part, you will build a regression model using Sparks ML library and Databricks

Part I: Building Linear Regression Models using Sparks ML Library In this part, you will build a regression model using Sparks ML library and Databricks community edition to run Spark. Steps: Download the Abalone dataset from the LibSVM website https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression.htmlLinks to an external site. If you would like to learn more about this dataset, check the original source of the dataset hereLinks to an external site.. According to their description, this dataset is used for "Estimating the age of abalone from physical measurements." More information can be found on the original dataset page. This dataset is used for regression analysis where the abalone's various physical measurements are used to estimate the abalone's age. After you download the dataset, follow these steps: add a .txt extension to the extracted file in order to be able to open it in a text editor. Do NOT use Notepad. Use a more sophisticated editor such as Sublime to open the file. Each object in this dataset file represents an abalone. An abalone is described by a set of physical attributes (or features), and an age value. Open this dataset using sublime, or any other text editor. Each line in the file represents a patient in this format: :< attribute value> :< attribute value> For example: 1.0708 1:12.3 2:23 3:154.25 4:67.75 5:36.2 6:93.1 7:85.2 8:94.5 9:59.0 Go to your Community Edition Databricks account and create a new workspace & name it LinReg. Upload the abalone dataset textfile to databricks by using the Data icon on the left menu. When you upload your file, make sure that you copy the file location on the databricks S3 file system. It should look something like this: /FileStore/tables/abalone.txt Copy that path and save it in a separate textfile for now. You will need to copy it into the linear regression code later. Go to the Spark ML online documentation part about Regression. It contains sample code for how to create a Regression model. https://spark.apache.org/docs/2.2.0/ml-classification-regression.html#linear-regression (Links to an external site.) Choose the code for the language that you prefer to work with. Copy that code and paste it into the databaricks workspace that you have just created. Modify the code by removing the name of the file: data/mllib/sample_linear_regression_data.txt and replacing it with the full path to the dataset file that you saved in step 3. Create a cluster, attach it to your workspace/notebook, and run the code! What is the RMSE of the regression model that the code built? Is it high or low? Research online for the meaning of RMSE, compare it to the SSE that was discussed in the lesson slides. Notice that to tell whether the RMSE is high or low, you need to compare it to the range of values that you are trying to estimate. For example, for the abalone dataset, the age of an abalone can be 7 or 15 or 2. Find the full range of abalone ages and compare the RMSE to that range. Then you can decide whether the RMSE is low or high. That's because the RMSE, like the SSE, measures the error in estimating the abalone's age. On the top menu, click on your cluster name -> View Spark UI option. This will show you details about how Spark ran your code. Check all the top tabs in the Spark UI. These are Jobs, Stages, Storage, etc. How many jobs did Spark submit to run your code? How many blocks was the RDD holding the dataset divided into? Take screenshots for this information once you find it in the various tabs. Take screen shots of the code in your notebook, the obtained results, and the Spark GUI tabs containing the answers to the questions in Step 10. Also, do not forget to answer the question in Step 8.

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer

Step: 1 Unlock blur-text-image

Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock

Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!

? Notice that to tell whether the RMSE is high or low, you need to compare it to the range of values that you are trying to estimate. For example, for the abalone dataset, the age of an abalone can...

Set Student Name: 1. Describe the relationship between two variables that have a correlation coefficient value: a. Near -1 b. Near 0 c. Near 1 2. Data was collected where a weightlifter was asked to...

Instuctor's Annotated Edition TENTH EDITION Understandable Statistics Concepts and Methods Charles Henry Brase Regis University Corrinne Pellillo Brase Arapahoe Community College Australia Brazil...

Wheaton etal paper on Construction Costs over time: Write one pg report with references... citations MLA or Harvard style after reading attached article. I will also attach my paper which I got a low...

( technology forecasting) can u tell me what method/ model used in this report, need only two. choose from this one https://pdfs.semanticscholar.org/fede/a35685428f15a3a7a32db5b96f232e0dcda3.pdf?_ga=...

ML in a nutshell Optimization, and machine learning, are intimately connected. At a very coarse level, ML works as follows. First, you come up somehow with a very complicated model y = M(x, 0), which...

Table of Contents Introduction. Hypothesis. Methods ..5 148 194714) Results.. Table I Western Governor Township Race by Family History of Heart Disease. Table 3 Analysis of Variance Difference in...

provide a feeback on the post below In finance, forecasting is often conducted using financial modeling processes. Which is described as the task of building an abstract representation (a model) of a...

do you agree or disagree with the post below In finance, forecasting is often conducted using financial modeling processes. Which is described as the task of building an abstract representation (a...

why is the post below so important? provide a feedback that adds value In finance, forecasting is often conducted using financial modeling processes. Which is described as the task of building an...

Company A sells a product for $50 per unit, and its total fixed costs are $20,000. The variable cost per unit is $20. Calculate the breakeven point in units and in dollars. Also, calculate the total...

Janice is a travel agent. She receives a 7% commission based on the value of the trips she books. Today she spent five hours arranging a $3,300 cruise for a newlywed couple. a. How much commission...

All of the U . S . states have adopted a tax based on the net taxable income of corporations. True False

Please answer this question. Select all possible resonance structures of the molecule below. Indicate which of the resonance structures are the most stable. Resonance structure Resonance structure...

Why do HCMSs exist? Do they change over time?

Suppose the price of oil falls sharply (as it did in 1986 and again in 1998). a. Show the impact of such a change in both the aggregate-demand/aggregate-supply diagram and in the Phillips-curve...

When did the shift from Text-based Business Application Software to GUI-based Applications begin?