Question: Introduction MapReduce can often be used to break down a complex problem into a set of smaller sub-problems. In processing geographic data, for example, we

Introduction

MapReduce can often be used to break down a complex problem into a set of smaller sub-problems. In processing geographic data, for example, we may have data which interacts with geographically nearby data, but doesn't interact with data beyond a certain distance. In these cases, we can use Spark MapReduce to subdivide the data on a grid, and process each grid cell separately or in conjunction with nearby grid cells.

In this lab, you'll be given a collection of 2D points on a grid, and a grid cell size. Each 2D point with have an alphanumeric name, an X coordinate, and a Y coordinate (X and Y are real number values; store these as type double). Grid cells will originate from the origin, and will be squares of the specified cell size.

Your program should read the points, and map each point to the cell that contains it, along with the 4 adjacent cells, 3 below and 1 to the right. These cells contain all possible pairs of points which are within the cell size distance of each other.

The output from your program will be a (key, value) pair for each cell which has points mapped to it. The key will be a cell identifier, and the value will be a list of points which are mapped to that cell. You should construct a cell name using the grid cell coordinates.

Submitting the Lab

Submit code to SVN in a directory named Lab3. Submit your .java files and your Makefile. Don't submit input files, output files, or temporary files. Your code must execute correctly on the parcom machines.

Program Arguments

Your Spark Java program must accept 4 command line arguments: master input output cellSize

As in Lab 1, "master" will be "local" to run the program locally, and "spark-yarn" to run on the cluster. The input and output parameters are the Spark input and output specifications. The cellSize parameter should be a real number specifying the cell size (must be > 0).

Input

The input is a set of CSV (Comma-Separated Values) text files. Each line in each file contains the ID of a point, the floating-point X coordinate, and the floating-point Y coordinate, separated by commas. No additional white space is present on the line. Here's an example input line:

Pt04,4.84535,4.05086

You can use the String split() method to separate the three fields, using "," as the separator, and the Double.parseDouble() (Links to an external site.)Links to an external site. method to parse the floating point values.

Once you've parsed the values for each point on initial input, store them in a Java class that you define to hold each data point. The class should implement java.io.Serializable (Links to an external site.)Links to an external site. (since the members will be of type String and double, which are Serializable, you don't need any special methods in the class). Define a toString() method in your class to output the values, separated by spaces (not commas), in the same order as the input. Use cell size 1.0 for the Short input data, and cell sizes 1.0 and 3.0 for the Full input data.

Output

The output should be generated by using the saveAsTextFile method on the final RDD. You should generate this RDD by using groupByKey and sortByKey as the final transformations. The key should be the cell ID (see below), and the value should be a list of points, with point name, X, and Y values separated by spaces (not commas). Spark will use the toString() method you defined above to format the output for each point. Here's how a sample line would appear as written by saveAsTextFile:

(3:1,[Pt00 3.49838 0.662006, Pt16 3.93387 1.98305, Pt19 2.33941 0.867107]) There are three subdirectories in the zip file: out_short_1 uses the Lab3Short input with cell size 1.0, out_full_1 uses the Lab3Full input with cell size 1.0, and out_full_3 uses the Lab3Full input with cell size 3.0.

Method

Defining a Grid of Cells

To reduce this problem to a set of smaller problems, divide the 2D space into a rectangular grid. Make the grid cell size the maximum distance between pairs of desired points (defined by the 4th program argument). This ensures that, for a point in any grid cell, all points within the cell size distance are either in the same grid cell or an adjacent grid cell. Grid cells can be designated by integer x, y coordinate values. For floating point coordinates x, y, the integer cell x, y can be calculated as follows:

int xCell = (int) Math.floor(x / cellSize);
int yCell = (int) Math.floor(y / cellSize);

Note that xCell and yCell may be negative; that's not a problem. Now that the cells are defined, a unique key string can be defined for any cell by concatenating the string representation of xCell, followed by a colon (:), followed by the string representation of yCell. For instance, the key string for xCell=5 and yCell=-1 would be "5:-1".

Mapping to the Grid

In the mapper, for each input line, use flatMapToPair to map each point to the cell containing it (xCell:yCell), the cell to the right (xCell+1:yCell), and the three adjacent points below: (xCell-1:yCell+1), (xCell:yCell+1), and (xCell+1:yCell+1).

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!