Question: Introduction MapReduce can often be used to break down a complex problem into a set of smaller sub-problems. In processing geographic data, for example, we

Introduction

MapReduce can often be used to break down a complex problem into a set of smaller sub-problems. In processing geographic data, for example, we may have data which interacts with geographically nearby data, but doesn't interact with data beyond a certain distance. In these cases, we can use Spark MapReduce to subdivide the data on a grid, and process each grid cell separately or in conjunction with nearby grid cells.

In this lab, you'll be given a collection of 2D points on a grid, and a grid cell size. Each 2D point with have an alphanumeric name, an X coordinate, and a Y coordinate (X and Y are real number values; store these as type double). Grid cells will originate from the origin, and will be squares of the specified cell size.

Your program should read the points, and map each point to the cell that contains it, along with the 4 adjacent cells, 3 below and 1 to the right. These cells contain all possible pairs of points which are within the cell size distance of each other.

The output from your program will be a (key, value) pair for each cell which has points mapped to it. The key will be a cell identifier, and the value will be a list of points which are mapped to that cell. You should construct a cell name using the grid cell coordinates.

Submitting the Lab

Submit code to SVN in a directory named Lab3. Submit your .java files and your Makefile. Don't submit input files, output files, or temporary files. Your code must execute correctly on the parcom machines.

Program Arguments

Your Spark Java program must accept 4 command line arguments: master input output cellSize

As in Lab 1, "master" will be "local" to run the program locally, and "spark-yarn" to run on the cluster. The input and output parameters are the Spark input and output specifications. The cellSize parameter should be a real number specifying the cell size (must be > 0).

Input

The input is a set of CSV (Comma-Separated Values) text files. Each line in each file contains the ID of a point, the floating-point X coordinate, and the floating-point Y coordinate, separated by commas. No additional white space is present on the line. Here's an example input line:

Pt04,4.84535,4.05086

You can use the String split() method to separate the three fields, using "," as the separator, and the Double.parseDouble() (Links to an external site.)Links to an external site. method to parse the floating point values.

Once you've parsed the values for each point on initial input, store them in a Java class that you define to hold each data point. The class should implement java.io.Serializable (Links to an external site.)Links to an external site. (since the members will be of type String and double, which are Serializable, you don't need any special methods in the class). Define a toString() method in your class to output the values, separated by spaces (not commas), in the same order as the input. Use cell size 1.0 for the Short input data, and cell sizes 1.0 and 3.0 for the Full input data.

Output

The output should be generated by using the saveAsTextFile method on the final RDD. You should generate this RDD by using groupByKey and sortByKey as the final transformations. The key should be the cell ID (see below), and the value should be a list of points, with point name, X, and Y values separated by spaces (not commas). Spark will use the toString() method you defined above to format the output for each point. Here's how a sample line would appear as written by saveAsTextFile:

(3:1,[Pt00 3.49838 0.662006, Pt16 3.93387 1.98305, Pt19 2.33941 0.867107]) There are three subdirectories in the zip file: out_short_1 uses the Lab3Short input with cell size 1.0, out_full_1 uses the Lab3Full input with cell size 1.0, and out_full_3 uses the Lab3Full input with cell size 3.0.

Method

Defining a Grid of Cells

To reduce this problem to a set of smaller problems, divide the 2D space into a rectangular grid. Make the grid cell size the maximum distance between pairs of desired points (defined by the 4th program argument). This ensures that, for a point in any grid cell, all points within the cell size distance are either in the same grid cell or an adjacent grid cell. Grid cells can be designated by integer x, y coordinate values. For floating point coordinates x, y, the integer cell x, y can be calculated as follows:

int xCell = (int) Math.floor(x / cellSize);

int yCell = (int) Math.floor(y / cellSize);

Note that xCell and yCell may be negative; that's not a problem. Now that the cells are defined, a unique key string can be defined for any cell by concatenating the string representation of xCell, followed by a colon (:), followed by the string representation of yCell. For instance, the key string for xCell=5 and yCell=-1 would be "5:-1".

Mapping to the Grid

In the mapper, for each input line, use flatMapToPair to map each point to the cell containing it (xCell:yCell), the cell to the right (xCell+1:yCell), and the three adjacent points below: (xCell-1:yCell+1), (xCell:yCell+1), and (xCell+1:yCell+1).

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer

Step: 1 Unlock blur-text-image

Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock

Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!

Business Research MethodologyQuestion Bank 1 1. When the marketing department of an organization attempts to determine the amount of time the managers in this department spend at their computers in...

CHA P TER 9 Understanding Software: A Primer for Managers 1. INTRODUCTION L E A R N I N G O B J E C T I V E S 1. Recognize the importance of software and its implications for the rm and strategic...

Business Research Methodology- Question Bank 1 1. When the marketing department of an organization attempts to determine the amount of time the managers in this department spend at their computers in...

Hi there! One more question for you. Similar to my last one. I have to have it done by tonight. Let me know if you cannot do it. Here it is: Scenario 3 May I Propose To... You are a partner in a...

Hi guys, I am trying to properly answer an accounting question concerning shares, bonds, and stock. I was given two annual reports, and am asked to make a decision on one of the companies depending...

You are a partner in a prestigious accounting firm, and you just received requests for proposal from two companies; Advanced Micro Devices, and Facebook. The deadline for the proposals is, as is...

Hi there! One more question for you. Similar to my last one. I have to have it done by tonight. Let me know if you cannot do it. Here it is: Scenario 3 May I Propose To... You are a partner in a...

Research Report Students name Professors name Institution Date Contents Title .. 3 Abstract 3 Acknowledgement .3 Introduction .4 Main body of Report/Methodology 5 Various kinds of isolation.6...

There are two problems due this week (each worth 35 points) as follows. Problem 1.6 (page 20) In comprehensive paragraphs, answerrequirements a to e. You will have 5 paragraphs total of four to five...

Topic: Conducting personal job interviews using the star model 1-Design a two-hour training work plan for 10 trainees 2-Determine the quality of trainees 3-Use the training design model Formulate one...

The first and second ionization energies of K are 419 kJ/mol and 3052 kJ/mol, and those of Ca are 590 kJ/mol and 1145 kJ/mol, respectively. Compare their values and comment on the differences.

Fly Corporation issued 10-year bonds on January 1, 2006. On July 1, 2012 Fly Corporation decides to retire the bonds early. The call price is $212,000 and the carrying value of the bonds on the date...

A company has purchased and installed a fixed - pitch propeller. The associated costs are illustrated in the exhibit below. Exhibit : Fixed - pitch propeller purchase and installation costs Purchase...

Help Solve 4 Required information [The following information applies to the questions displayed below.) Part 4 of 4 Indicate the amount (if any) that Josh can deduct as an ordinary and necessary...

2. What is your opinion of Apprenticeship 3.0? Is it an approach that you would consider using as an HR manager, or does it need further development?

LO4 Provide an overview of four challenges facing HR today.

2. Give some examples of ethical issues that you have experienced in jobs and explain how HR did or did not help resolve them.