Question: Using the MapReduce concepts, you will write a Python program that will index a set of documents ( Provided Below ) and build a function

Using the MapReduce concepts, you will write a Python program that will index a set of documents (Provided Below) and build a function to search for the frequencies of words in these documents.
Data: Assignment 1 dataDownload Assignment 1 data
How to Start
Index the files
Read each text file provided and convert each word in the file to lowercase.
Create a list with words from each text file.
Remove stop words from each list and get the final list of words for each text file. [List of Stopwords stopwords.txt]Download stopwords.txt]
Build a dictionary for each word with KEY being the document ID (file name) and VALUE as frequency (number of times the word appears in that particular file).
Fire a Query that:
Take a word as input.
Create a list with words from each text file.
Remove stop words.
Score each document by summing the frequency of the input word in the document.
Deliverables
Your deliverable for this assignment should be a Python file named "search.ipynb"
An Example
Indexing example
01.txt: A quick brown fox jumps over the lazy dog
02.txt: Austin is the capital of Texas Houston is not the capital of Texas
03.txt: I am going home I shall eat after I reach home
04.txt: I am going to the University of Texas Austin
Step 1: Lowercase conversion
Output
01.txt: a quick brown fox jumps over the lazy dog
02.txt: austin is the capital of texas houston is not the capital of texas
03.txt: i am going home i shall eat after i reach home
04.txt: i am going to university of texas austin
Step 2: Splitting the text into list of words
Output
01.txt: [a,quick,brown,fox,jumps,over,the,lazy,dog]
02.txt: [austin,is,the,capital,of,texas,houston,is,not,the,capital,of,texas]
03.txt: [i,am,going,home,i,shall,eat,after,i,reach,home]
04.txt: [i,am,going,to,university,of,texas,austin]
Step 3: Remove stopwords (in red)
01.txt: [a,quick,brown,fox,jumps,over,the,lazy,dog]
02.txt: [austin,is,the,capital,of,texas,houston,is,not,the,capital,of,texas]
03.txt: [i,am,going,home,i,shall,eat,after,i,reach,home]
04.txt: [i,am,going,to,university,of,texas,austin]
Step 4: Putting them in the dict (indexing)
Output:
Key
Value (document it appeared and number of time)
quick
[(01,1)]
brown
[(01,1)]
fox
[(01,1)]
jumps
[(01,1)]
lazy
[(01,1)]
dog
[(01,1)]
austin
[(02,1),(04,1)]
capital
[(02,2)]
texas
[(02,2),(04,1)]
houston
[(02,1)]
going
[(03,1),(04,1)]
home
[(03,2)]
shall
[(03,1)]
eat
[(03,1)]
reach
[(03,1)]
university
[(04,1)]
Fire the Query
if a user inputs the word: "Austin"
the program should output: "The word Austin is repeated two times,
1 time in doc 2
1 time in doc 4
(The output can be as fancy as you want it to be as long as it does not deviate from the goal)
Make sure that the user can only input one word at a time. The program should only stop if the user inputs the letter 'q'.

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Programming Questions!