Question: --------------------------------------------------------------------------------------- #mrex1.py ''' Implement a MapReduce program that computes the most popular bigram (2-gram) of all time in the dataset (as determined by the count

--------------------------------------------------------------------------------------- #mrex1.py ''' Implement a MapReduce program that computes the most popular bigram (2-gram) of all time in the dataset (as determined by the count field). ''' from mrjob.job import MRJob class MyMRJob(MRJob): def mapper(self,

$_, line): data=line.split('\t') ngram = data[0].strip() year = data[1].strip() count = data[2].strip()$

---------------------------------------------------------------------------------------

#mrex1.py

'''

Implement a MapReduce program that computes the most popular bigram (2-gram) of all time in

the dataset (as determined by the count field).

'''

from mrjob.job import MRJob

class MyMRJob(MRJob):

def mapper(self, _, line):

data=line.split('\t')

ngram = data[0].strip()

year = data[1].strip()

count = data[2].strip()

pages = data[3].strip()

books = data[4].strip()

#Emit key-value pairs where key is ngram+year and value is count of ngram

yield ngram+year, int(count)

def reducer(self, key, list_of_values):

# Send all (count, ngram+year) pairs to the same reducer.

# So we can easily use Python's max() function.

yield None, (sum(list_of_values),key)

def reducer2(self, _, list_of_values):

# Reducer-2 get input tuples as follows:

# None, [(212, cloud computing 2006), (156, mobile phones 2003)]

# max function will yield tuple with max value of the count

yield max(list_of_values)

def steps(self):

return [self.mr(mapper=self.mapper, reducer=self.reducer), self.mr(reducer=self.reducer2)]

if __name__ == '__main__':

MyMRJob.run()

-----------------------------------------------------------------------------------

#mrex2.py

""" Implement a MapReduce program that computes the most common bigram in each year in the dataset (as determined by the count field). Output of the program should include: (year, bigram, count) Example output: (2001, mobile phone, 5002) means that in the year 2001 the most popular bigram was 'mobile phone' and it appeared 5002 times in all the books in that year. Emit such tuples for each year in the dataset.

"""

from mrjob.job import MRJob

class MyMRJob(MRJob): def mapper(self, _, line): data=line.split('\t') ngram = data[0].strip() year = data[1].strip() count = data[2].strip() yield year, (int(count),ngram)

def reducer(self, key, list_of_values): yield key, max(list_of_values) if __name__ == '__main__': MyMRJob.run()

--------------------------------------------------------------------------------

#mrjob.conf

runners: emr: aws_access_key_id: aws_secret_access_key: ec2_key_pair: mykeypair ec2_key_pair_file: /home/ubuntu/mykeypair.pem ssh_tunnel_to_job_tracker: true ec2_instance_type: m3.xlarge num_ec2_instances: 1

Problem 2 - MapReduce for analyzing Google n-gram Dataset Google n-gram dataset which is a freely-available collection of n-grams (fixed size tuples of words) extracted from the Google Books corpus. The n specifies the number of elements in the tuple, so a 5- gram contains five words The n-grams in this dataset were produced by passing a sliding window over the text of books and outputting a record for each new token. For example for the line - 'Python is a high level language' The 2-grams will be (Python, is) (is, a) (a, high) (high, level) (level, language) Format Each row of data contains 1) n-gram itself 2) year in which the n-gram appeared 3) number of times the n-gram appeared in the books from the corresponding year (count) 4) number of pages on which the n-gram appeared in this year (page-count) 5) number of distinct books in which the n-gram appeared in this year (book count) Problem 2 - MapReduce for analyzing Google n-gram Dataset Google n-gram dataset which is a freely-available collection of n-grams (fixed size tuples of words) extracted from the Google Books corpus. The n specifies the number of elements in the tuple, so a 5- gram contains five words The n-grams in this dataset were produced by passing a sliding window over the text of books and outputting a record for each new token. For example for the line - 'Python is a high level language' The 2-grams will be (Python, is) (is, a) (a, high) (high, level) (level, language) Format Each row of data contains 1) n-gram itself 2) year in which the n-gram appeared 3) number of times the n-gram appeared in the books from the corresponding year (count) 4) number of pages on which the n-gram appeared in this year (page-count) 5) number of distinct books in which the n-gram appeared in this year (book count)

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer

Step: 1 Unlock blur-text-image

Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock

Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!

JAVA Test Create NewRandom class (Apply inheritance) Implement Implement Implement Implement Implement Implement Implement Write a test new method 1 new method 2 new method 3 new method 4 new method...

Need help getting started on these questions. I am supposed to add code where it says "implement me" and write the answer where it says answer in one or two line. Need to fill in the "Implement me"...

This tat will be available until March 31st. 1159 PM EDT Programming Project it. Electrostatics of Point Charges Background You probably know that the branch of physica mathematica amework for the...

PROJ592 COURSE PROJECT GUIDELINES Course Project Part 1Due Week 1 (graded, 40 points) MS Project Tutorials 5 points 1. Before starting your Course Project assignments, watch the MS Project videos and...

A. public class Frac { // constructor that takes 2 integer arguments, a numerator and denominator public Frac(int num, int denom) { // TODO: implement this constructor } // constructor that takes a...

starter code #include #include #include #include #define MAX_STR_LEN 1024 typedef struct castList_struct { } CastList; // Used to store information about a movie typedef struct movieReview_struct {...

Need to fill in all parts that say "Implement me" and answer in one or two lines here. The following cell contains code that will be referred to as the Preprocessing Block from now on. It contains a...

//Bring in unit testing code and tell it to build a main function #define DOCTEST_CONFIG_IMPLEMENT_WITH_MAIN //This pragma supresses a bunch of warnings QTCreator produces (and should not) #pragma...

PS: in java, using netbeans8.2 MyDate -int day -int month -int year // TO BE COMPLETED Person -name:String -address:String -email:String - bday:MyDate -phoneNumber:int // TO BE COMPLETED Bookstore...

#include #include #include #include using namespace std; // Function prototypes vector getVector(string); string getName(string); void selectionSort(vector &); bool binarySearch(string, vector );...

Develop a Gantt chart for the following activities. Identify all paths through the network. What is the critical path? Optional: Solve this problem with Microsoft Project. How does clicking on...

For the network shown in fig 13.5 find Vo(t),t>0. Discuss.

What are the cons of cloud computing? Site access Sensitive data Internet Outage All of the options

Near the end of an audit, the application of analytical procedures is Multiple Choice not useful, since detailed substantive procedures have already been performed. required by auditing standards....

=+2 Determine the specific cross-cultural training needs (from the organization level, assignment level, and the individual level).

=+j How will the MNE assess the differing content (skills and knowledge) that each locale requires?

=+3 Establish the goals and measures for determining training effectiveness.