Question: using scala This exercise is to calculate the correlation coefficient of two numeric vectors ( of the same length) Also: to get familiar with the

using scala

This exercise is to calculate the correlation coefficient of two

numeric vectors ( of the same length)

Also: to get familiar with the ideas of Scala as a Functional Programming language useful within Spark.,..and the common use of Higher Order Functions(HOF)

I have given code to carry out most of the tasks needed, you only need to add a couple of lines

Background for the correlation coefficient:

1. First, center the vectors: if x, y are centered vectors, that is, the mean is subtracted from each value ( see code below for how this is done)

2. Definition of the dot product ( if the vectors are centered) is coded for you

3. Norm of a vector ( its euclidean length) is coded 3. Theta (radians) is the angle between the two vectors and is to be calculated

x dot y = |x| |y| cos (theta ). // definition of the dot product

cos(theta) IS the correlation coefficient cos^2 is the R2 you hear about

Example: v = Vector(1.0, 2.0, 3.0)

val cv =center(v) = Vector(-1, 0, 1)

w = Vector ( -1.0, 4.0, 6.0)

val cw. = center(w) = Vector ( -4, 1, 3)

cv.cw = -1 * -4 + 0 * 1 + 1 * 3 = 7

|cv| = sqrt(cv.cv) = sqrt (2)

|cw| = sqrt ( cw.cw) = sqrt ( 26)

cos theta = 7 / (sqrt(2) * sqrt (26)). // radians

If you are Databricks then the library is already loaded for you., If in IntelliJ you need to have your build.sbt file have the Spark libraries.

*/ import org.apache.spark import org.apache.spark.sql.SparkSession import org.apache.spark.sql.types._ import org.apache.spark.sql.functions._ import org.apache.spark.sql.Row import org.apache.spark.rdd.RDD import org.apache.spark.sql.SQLContext import org.apache.log4j.{Level, Logger} import org.apache.spark.rdd import scala.math._ Logger.getLogger("org").setLevel(Level.OFF) val spark = SparkSession .builder() .appName("dfdsInteractive.sc") .master("local[*]") .getOrCreate() spark.conf.set("spark.sql.shuffle.partitions", "5") import spark.implicits._ type S = String; type D = Double; type I = Integer type V = Vector[D] def dot(v:V, w:V):D = (v zip w).map{case (x,y)=>x*y}.sum def norm(v:V):D = math.sqrt(dot(v,v)) def mean(v:V):D= v.sum/v.size def center(v:V):V = v.map{_ - mean(v)} case class Data(nr: I, height:D, weight:D) val clientsDF = Seq( (1, 60.0, 120.0), (2, 65.0, 130.0), (3, 72.0, 169.0), (4, 70.0, 150.0 ), (5, 73.0, 140.0) ).toDF("nr", "height", "weight") clientsDF.show() val clientsDS = clientsDF.as[Data] clientsDS.show() val X = clientsDS.collect().map{c =>c.height}.toVector val Y = clientsDS.collect().map{c => c.weight}.toVector val x1 = center(X) val y1 = center(Y) ..... continue on with this code and calculate the cos theta ( which is in radians) and convert that to degrees,.,,.

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!