Question: package org.myorg; import java.io.IOException; import java.util.regex.Pattern; import org.apache.hadoop.conf.Configured; import org.apache.hadoop.util.Tool; import org.apache.hadoop.util.ToolRunner; import org.apache.log4j.Logger; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Reducer; import org.apache.hadoop.fs.Path; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import

package org.myorg;

import java.io.IOException; import java.util.regex.Pattern; import org.apache.hadoop.conf.Configured; import org.apache.hadoop.util.Tool; import org.apache.hadoop.util.ToolRunner; import org.apache.log4j.Logger; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Reducer; import org.apache.hadoop.fs.Path; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text;

public class WordCount extends Configured implements Tool {

private static final Logger LOG = Logger .getLogger( WordCount.class);

public static void main( String[] args) throws Exception { int res = ToolRunner .run( new WordCount(), args); System .exit(res); }

public int run( String[] args) throws Exception { Job job = Job .getInstance(getConf(), " wordcount "); job.setJarByClass( this .getClass());

FileInputFormat.addInputPaths(job, args[0]); FileOutputFormat.setOutputPath(job, new Path(args[ 1])); job.setMapperClass( Map .class); job.setReducerClass( Reduce .class); job.setOutputKeyClass( Text .class); job.setOutputValueClass( IntWritable .class);

return job.waitForCompletion( true) ? 0 : 1; } public static class Map extends Mapper { private final static IntWritable one = new IntWritable( 1); private Text word = new Text();

private static final Pattern WORD_BOUNDARY = Pattern .compile("\\s*\\b\\s*");

public void map( LongWritable offset, Text lineText, Context context) throws IOException, InterruptedException {

String line = lineText.toString(); Text currentWord = new Text();

for ( String word : WORD_BOUNDARY .split(line)) { if (word.isEmpty()) { continue; } currentWord = new Text(word); context.write(currentWord,one); } } }

public static class Reduce extends Reducer { @Override public void reduce( Text word, Iterable counts, Context context) throws IOException, InterruptedException { int sum = 0; for ( IntWritable count : counts) { sum += count.get(); } context.write(word, new IntWritable(sum)); } } } package org.myorg; import java.io.IOException; import java.util.regex.Pattern; import org.apache.hadoop.conf.Configured; import org.apache.hadoop.util.Tool; import org.apache.hadoop.util.ToolRunner;

Second part: Your task!!! 1. You may need first to get Hadoop running and run the example WordCount program. Download the output from cluster and then submit your results 2. You should then modify the WordCount program so it outputs the wordcount for each distinct word in each file. So the output of this DocWordCount program should be of the form 'word#####filename count', where '#####, serves as a delimiter between word and filename and tab serves as a delimiter between filename and count. Submit your source code in a file named DocWordCount.java. (The input here should be the Canterbury corpus provided in the package) Explanation: Consider two simple files file.txt and file2.txt. S echo "Hadoop is yellow Hadoop" > filel.txt S echo "yellow Hadoop is an elephant" > file2.txt Running 'DocWordCount.java' on these two files will give an output similar to that below, where ##### is a delimiter. Output of DocWordCount java Second part: Your task!!! 1. You may need first to get Hadoop running and run the example WordCount program. Download the output from cluster and then submit your results 2. You should then modify the WordCount program so it outputs the wordcount for each distinct word in each file. So the output of this DocWordCount program should be of the form 'word#####filename count', where '#####, serves as a delimiter between word and filename and tab serves as a delimiter between filename and count. Submit your source code in a file named DocWordCount.java. (The input here should be the Canterbury corpus provided in the package) Explanation: Consider two simple files file.txt and file2.txt. S echo "Hadoop is yellow Hadoop" > filel.txt S echo "yellow Hadoop is an elephant" > file2.txt Running 'DocWordCount.java' on these two files will give an output similar to that below, where ##### is a delimiter. Output of DocWordCount java

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer

Step: 1 Unlock blur-text-image

Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock

Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!

JAVA HADOOP MAPREDUCE Modify the WordCount program so it outputs the wordcount for each distinct word in each file. So the output of this DocWordCount program should be of the form word#####filename...

Big Data Programming Please answer the question and meet all of the requirements! Please provide the source code that you used! I have attached the Java code, which I was provided, at the end of this...

The goal is to design an automated data pipeline using hdfs... The goal is to design an automated data pipeline using hdfs commands and MapReduce.It will be unix script with map reduce code and...

Part 3 : Counting Sequences of \ ( N \ ) - grams ( 2 5 \ % ) Using WordLengthCount.java as a starting point, extend it to count the sequences of \ ( N \ ) - grams ( call the program NgramCount.java )...

Part 4 : Computing Relative Frequencies of Sequences of \ ( N \ ) - grams ( 2 5 \ % ) Extend Part 3 ' s program to compute the relative frequencies of the sequences of \ ( N \ ) - grams ( call the...

Page Rank algorithm Hadoop implementation Add the final MapReduce stage so the whole PageRank calculation job can be submitted and run correctly. In HadoopPageRank.java, look for the string place...

How can I run all as one source code? import java.io.IOException; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import...

hadoop file system I am trying torun the code below in hdfs but it couldn't work well it is LetterCount similar to wordcount example import java.io.IOException; import java.util.*; import...

IN THE CODE YOU WILL SEE WHERE IT SAYS WHERE THE CODE SHOULD BE PROVIDED IN THE MAPPER AND REDUCER. SOMEONE PLEASE SOLVE THE CODE. THANKS Dataset: The toy dataset is the following graph. The PageRank...

//IGNORE THE HADOOP LIBRUARY PART. CAN SHOW ME HOW TO DEBUG THIS CODE THE CORRECT WAY? //There are a few errors in this code that are placed there intentionally as part of the assignment. i know you...

In Albania, today the USD is worth ALL 102 compared to ALL 105 one year ago. One the other hand, today the Euro is worth ALL 123 compared to ALL 135 one year ago. What is the cross rate of Euro with...

Julie is a 55year old woman with mild mental retardation. She lives in an independent support living arrangement and she works at a bakery in a supported employment arrangement. She weighs 205 pounds...

Explain a contribution margin income statement. What does it indicate and how is this information used?

You've been saving up for a new car that you think costs $25,000. You already have $10,000 and you think that, with interest and additional savings, the $10,000 will grow to $20,000 in three years....

A Many social movements benefit from social networks, but is it fair to credit electronic communication with bringing about social change? How did groups like the American civil rights movement...

Additional Factors Affecting Group Communication?

Have you ever been excessively quiet or shy in a group? Do you consider this behavior social loafing or do you feel that the situational or relational context is primarily to blame? Why?