Question: Question: MapReduce You are a data scientist working with the United States Internal Revenue Service. The IRS maintains a registry of all United States individual

Question: MapReduce

You are a data scientist working with the United States Internal Revenue Service. The IRS maintains a registry of all United States individual taxpayers. For each taxpayer, the IRS stores the following attributes (this is not the full list):

  1. First Name,
  2. Middle Name 1,
  3. Middle Name 2,
  4. Last Name,
  5. Street (Address),
  6. City,
  7. State,
  8. Zip

The IRS wants to match their records (registry of tax payers) to 200 million DMV records with. The DMV records contain the same attributes as the IRS records. The IRS must determine whether a pair of (DMV, IRS) records refer to the same individual. To do this, they must compute a similarity score between every possible pair of DMV and IRS records. If there are 200 million IRS records, each DMV record will have 200 million possible matches and therefore 200 million similarity scores. They would like to end up with a collection of pairs, e.g. (DMVRecord_1, IRSRecord_234345) that represent the highest match each DMV record had with any IRS record. The final output will have 200 million pairs (the same number as available DMV records). Assuming that the IRS has given you a function that determines the similarity between two candidate pairs, your job is to design a MapReduce application to that generates the final matches.

Please answer the following questions, in your own words, on how you would design the MapReduce job:

1. The Mapper (1)

a. What is the input key and value combination (give the data types for the input key and

the input value)

b. What should the map function do to each input key value pair. Please be detailed and

specific

c. What is the output key value pair that is sent to the reducer (give the data types for the

output key and the output value)

2. The Reducer (1)

a. What are the datatypes for the key and values submitted by the mapper

b. What will the reducer do? What type of aggregation is required here?

c. What datatypes are needed for the key and value outputted from the reduce

1. The Mapper (2)

a. What is the input key and value combination (give the data types for the input key and

the input value)

b. What should the map function do to each input key value pair. Please be detailed and

specific

c. What is the output key value pair that is sent to the reducer (give the data types for the

output key and the output value)

2. The Reducer (2)

a. What are the datatypes for the key and values submitted by the mapper

b. What will the reducer do? What type of aggregation is required here?

c. What datatypes are needed for the key and value outputted from the reduce

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!