Question: Use MapReduce to Remove Duplicated Records For this program, we will have two input files. file1.txt 2012-3-1 a 2012-3-2 b 2012-3-3 c 2012-3-4 d 2012-3-5

Use MapReduce to Remove Duplicated Records

For this program, we will have two input files.

file1.txt

2012-3-1 a 2012-3-2 b 2012-3-3 c 2012-3-4 d 2012-3-5 a 2012-3-6 b 2012-3-7 c 2012-3-3 c

file2.txt

2012-3-1 b 2012-3-2 a 2012-3-3 b 2012-3-4 d 2012-3-5 a 2012-3-6 c 2012-3-7 d 2012-3-3 c

Notice that in these two files, there are some duplicated records (appears more than once). Our goal is to find these duplicates and only write them once in the output, save the output in file3.txt.

For example, the output would look like:

2012-3-1 a 2012-3-1 b 2012-3-2 a 2012-3-2 b 2012-3-3 b 2012-3-3 c 2012-3-4 d 2012-3-5 a 2012-3-6 b 2012-3-6 c 2012-3-7 c 2012-3-7 d

The basic idea is read all records using mapper, then just feed same input to the reducer. No matter how many times these records appear in the input file, just output once.

First step, you need to create the project use Maven. You need to enter input to file1.txt and file2.txt as described above.

In the mapper, output the entire record as key. In your reducer, you would output the key, so it would be unique.

After you implement the mapper and reducer, compile the program.

Use scp to upload your file1.txt and file2.txt to the master node and put them in a directory created in HDFS.

Use following as an example, remember to replace my file path and machine address to your own file path and machine address. (If you use EC2, then the user id is ubuntu, if you use EMR, the user id is hadoop)

Use MapReduce to Remove Duplicated Records For this program, we will have

when you complete your program, upload your jar file and run it using the command similar to following (name your own input and output directory)

$ hadoop jar DeDup-1.0.jar edu.missouri.hadoop.DeDup /user/hadoop/dedup_in /user/hadoop/dedup_out

Submit your implementation and paste the screen shot of your output in your report.

Thadoopeip-172-31-24-121 ]S hadoop fs -mkdir /user/hadoop/dedup_in [hadoopeip-172-31-24-121 -]s hadoop fs -put filel.txt /user/hadoop/dedup in [hadoopeip-172-31-24-121 ]S hadoop fs -put file2.txt /user/hadoop/dedup_in Chadoopeip-172-31-24-121 -]s hadoop fs -ls /user/hadoop/dedup.in Found 2 items -rw-r--r-1 hadoop supergroup -rw-r--r-1 hadoop supergroup [hadoopeip-172-31-24-121 -]s | 88 2014-06-30 06:22 /user/hadoop/dedup_in/filel.txt 88 2014-06-30 06:22 /user/hadoop/dedup_in/file2.txt Thadoopeip-172-31-24-121 ]S hadoop fs -mkdir /user/hadoop/dedup_in [hadoopeip-172-31-24-121 -]s hadoop fs -put filel.txt /user/hadoop/dedup in [hadoopeip-172-31-24-121 ]S hadoop fs -put file2.txt /user/hadoop/dedup_in Chadoopeip-172-31-24-121 -]s hadoop fs -ls /user/hadoop/dedup.in Found 2 items -rw-r--r-1 hadoop supergroup -rw-r--r-1 hadoop supergroup [hadoopeip-172-31-24-121 -]s | 88 2014-06-30 06:22 /user/hadoop/dedup_in/filel.txt 88 2014-06-30 06:22 /user/hadoop/dedup_in/file2.txt

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer

Step: 1 Unlock blur-text-image

Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock

Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!

Introduction Successive duplicated words are common in a draft text document. In this project you will write principled object-oriented C++ codes to read a text file which contains duplicated words...

Find and Remove duplicated words stored in a linked list Introduction Successive duplicated words are common in a draft text document. In this project you will write principled object-oriented C++...

Please do this in C++ language because I have only learned this language thus far. I need help with the source code called Texttool.h. Please construct a code on the line that says "TO BE COMPLETED"....

I need help with the source code called Texttool.h. Please construct a code on the part that says "TO BE COMPLETED". Only answer this question if you know how to do so,Thanks. Prompt: Find and Remove...

Big Data People You Might Know social network friendship recommendation algorithm builds on the idea that if two people have a lot of mutual friends, then the system should recommend them to friend...

"People You Might Know" social network friendship recommendation algorithm builds on the idea that if two people have a lot of common friends, then the system should recommend them to friend each...

Programming Assignment I CS 5433: Big Data Management MapReduce Jobs Part 1: Using flume collect two sets of twitter data. Each dataset should be of size 3MB or thereabouts. The two sets will be...

Please use python Problem 3. Duplicated Substrings A substring is a contiguous sequence of characters from a string. For example, "cde" is a substring of the string "abcdefg". We say that substring...

Lets consider the file with the following content Linux,20 Unix,30 AIX,25 Linux,25 Solaris,10 HPUX,100 Sort the file Sort the file on the basis of the first field Sort the file on the bases of the...

Cane Company manufactures two products called Alpha and Beta that sell for $170 and $130, respectively. Each product uses only one type of raw material that costs $6 per pound. The company has the...

Write a program to score the paper-rock-scissor game. Each of two users types in either P, R, or S. The program then announces the winner as well as the basis for determining the winner: Paper covers...

a timeline is not meaningful unless all cash flow occurs annually true or false

Rank the following solutes in order of increasing entropy when 0.0100 moles of each dissolve in 1.00 liter of water. Question list (4 items) Correct answer list (Drag and drop into the appropriate...

Imagine that the U.S. Congress, recognizing the importance of being well dressed, started giving preferential tax treatment to clothing insurance. Under this new type of insurance, you would pay the...

Find some information on an index fund (such as the Vanguard Total Stock Market Index, ticker symbol VTSMX). How has this fund performed compared with other stock mutual funds over the past 5 or 10...

According to an old myth, Native Americans sold the island of Manhattan about 400 years ago for $24. If they had invested this amount at an interest rate of 7 percent per year, how much would they...