Question: Could you please run the code below on spark using scala for the assignment below, if there are any mistakes please do correct. This was

Could you please run the code below on spark using scala for the assignment below, if there are any mistakes please do correct. This was done on a macbook pro using terminal. Also PLEASE POST SCREENSHOT OF OUTPUT for thumbs up.

Could you please run the code below on spark using scala for

val tf = sc.textFile("desktop/spark/linkage/block_1.csv")

def isHeader (line: String) : Boolean = {line. contains("id 1")}

val noHeader = tf.filter (x = !isHeader (x))

def toDouble(s: String) = {

if ("?".equals(s)) Double.NaN else s.toDouble

case class MatchData(id1: Int, id2: Int, scores:Array[Double], matched:Boolean)

def parse(line: String) = {

val pieces = line.split(',')

val id1 = pieces(0).toInt

val id2 = pieces(1).toInt

val matched = pieces(11).toBoolean

val scores = pieces.slice(2,11).map(toDouble)

MatchData(id1, id2, scores, matched)

}

val parsed = noHeaderRDD.map(parse(_))

import org. apache.spark.util.StatCounter

class NAStatCounter extends Serializable {

val stats: StatCounter = new statCounter)

var missing: Long = 0

def add(x: Double): NAStatCounter = {

if (java.lang.Double.isNaN(x)) {

missing += 1

}

else {

stats.merge (x)

}

this

}

def merge(other: NAStatCounter): NAStatCounter = {

stats.merge(other.stats)

missing += other.missing

this

}

override def toString = {

"stats: " + stats.tostring + "NaN: " + missing

}

object NAStatCounter extends Serializable {

def apply (x: Double) = new NAStatCounter(). add (x)

}

import org.apache.spark.rdd.RDD

def statsWithMissing(rdd: RDD[Array [Double]]) : Array [NAStatCounter] = {

val nasRDD = rdd. map (md => {

md.map (d => NAStatCounter (d) )

})

nasRDD.reduce ( (n1,n2) => {

n1.zip (n2).map { case (a, b) -> a.merge(b) }

val statsm = statsWithMissing(parsed.filter (_.matched).map(_.scores))

val statsn = statsWithMissing(parsed.filter (!_.matched).map(_.scores))

statsm.zip(statsn).map {case(m,n) => (m.missing + n.missing, m.stats. mean - n.stats.mean) }

Write a Scala program in Spark Shell to load the block_1.csv dataset using spark.read.csv(), accessible from the Software Repository of the D2L course site, and perform the following: 1. Convert the dataset to RDD 2. Remove the heading (first record (line) in the dataset) 3. Convert the first two fields to integers 4. Convert other fields except the last one to doubles. Questions marks should be NaN. The last field should be converted to a Boolean. 5. Output an array of statistics for fields of type Double grouped by the last field with minimal passes. the matched is the mean of the matched records, unmatched is the mean of the unmatched records, and the nomissing is the count of missing records. The higher the q, the better the feature in contributing the classification. Hints: Try to redo the parsing example from the lecture and learn basic Scala programming techniques. This homework is a little extension from that example

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer

Step: 1 Unlock blur-text-image

Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock

Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!

see below questions and provide me adjusted coding by using my coding. I share my coding and test failure message at the end. 1. BST.java 2.BSTNode.java Binary Search Tree you will be coding the...

need the code in c language MacBook Pro Abstract: Due March 1 or is computing involves 10 directly at point. This is in essence processor is generating its en enerating its values to computer e lri n...

There is a C++ assignment. I need a correct answer which could fulfill all the important notes. Besides, use the skeleton codes please. Thank you so much!!! Skeleton Code /* * COMP2011 (Spring 2021)...

Please answer me page 51 to page 56 on the attachment. is a multiple choice questions. Thank you FAC1502/101/3/2016 Tutorial letter 101/3/2016 Financial accounting concepts, principles and procedures...

@@ -0,0 +1,442 @@ # Workshop #5: Member Operators, Helper functions Version 1.0 In this workshop, you will implement different types of operator overload in a partially developed class. ## Learning...

Please do this in Python! 1. You will simulate round robin sort and shortest remaining job process scheduling algorithms. You will be able to compare and contrast the ease of implementation and the...

Given code: utils.c (for reference DON'T Modify), utils.h (DON't Modify) and main_template.c (Write Code HERE) --> UTILS.C [DO NOT MODIFY] pasting image cause Chegg character limit >:( --> UTILS.h...

Part I you are asked to use the introductory sentence about the tax payer and complete paragraph items 14 and 16. You will also use item 18 for needed Social Security numbers. Do not do additional...

LA1 the weekday calculator CS 1120 (Python) - Spring 2021 Lab Assignment 1 The weekday calculator Due Date (a two-week LA) Sections(540,543,544,545) 1/29/20 @ 11:59pm Concepts Review of CS1110...

please help I need this by tonight ASAP. if you solve this i will give you the BEST RATING. ALL I NEED IS THE FULL DETAILED CODE FOR THE MULTITASKING COMMANDER. other commanders are finished. I...

Dorough Pointers Inc. expects to begin operations on January 1, 2009; it will operate as a specialty sales company that sells laser pointers over the Internet. Dorough expects sales in January 2009...

Canada has often been described as an over-governed state. What features of Canada's system of government contribute to this opinion? Do you agree?

Novak accepted an offer payment from lga to help sway a business descision. Who is in the wrong in this situation?

Compared with half a century ago, adoption has become _ _ _ _ _ _ _ _ _ common, but it is more open and acceptabl e , so we probably discuss it _ _ _ _ _ _ _ . fill in the blanks more or much less or...

How are Work Breakdown Statements Built and how do they appear in a Project Plan?

What is the most important part of any HCM Project Map and why?

What is the Phase that begins after Project rollover and what activities are part of the Phase?