Question: As a data engineer, you are asked to do text analysis to find out the set of words that are frequently used within a file.

As a data engineer, you are asked to do text analysis to find out the set of words that are frequently used within a file. For this, you need to write a Map reduce program that identifies all the words whose length > 5 and the frequency of occurrence > 100.

Input Dataset: Dataset is present at the location (hdfs:///bigdatapgp/common_folder/assignment3/frequence)

Constraints:

  • You should consider only the Alphabets and Digits, and ignore any special character (. , : ; - + etc.) while splitting the words.
  • You should consider the words ROMAN, Roman, roman as same ( i.e. roman) while calculating the frequency.

Expected Output: List the words along with its frequency separated by space. For example,

roman 300 siward 240

....

Expected Solution: You need to paste the MR code, hadoop commands & path of the final jar that is used to achieve this output.

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!