Question: Using the MapReduce concepts, you will write a Python program that will index a set of documents ( Provided Below ) and build a function
Using the MapReduce concepts, you will write a Python program that will index a set of documents Provided Below and build a function to search for the frequencies of words in these documents.
Data: Assignment dataDownload Assignment data
How to Start
Index the files
Read each text file provided and convert each word in the file to lowercase.
Create a list with words from each text file.
Remove stop words from each list and get the final list of words for each text file. List of Stopwords stopwords.txtDownload stopwords.txt
Build a dictionary for each word with KEY being the document ID file name and VALUE as frequency number of times the word appears in that particular file
Fire a Query that:
Take a word as input.
Create a list with words from each text file.
Remove stop words.
Score each document by summing the frequency of the input word in the document.
Deliverables
Your deliverable for this assignment should be a Python file named "search.ipynb"
An Example
Indexing example
txt: A quick brown fox jumps over the lazy dog
txt: Austin is the capital of Texas Houston is not the capital of Texas
txt: I am going home I shall eat after I reach home
txt: I am going to the University of Texas Austin
Step : Lowercase conversion
Output
txt: a quick brown fox jumps over the lazy dog
txt: austin is the capital of texas houston is not the capital of texas
txt: i am going home i shall eat after i reach home
txt: i am going to university of texas austin
Step : Splitting the text into list of words
Output
txt: aquickbrownfoxjumpsoverthelazydog
txt: austinisthecapitaloftexashoustonisnotthecapitaloftexas
txt: iamgoinghomeishalleatafterireachhome
txt: iamgoingtouniversityoftexasaustin
Step : Remove stopwords in red
txt: aquickbrownfoxjumpsoverthelazydog
txt: austinisthecapitaloftexashoustonisnotthecapitaloftexas
txt: iamgoinghomeishalleatafterireachhome
txt: iamgoingtouniversityoftexasaustin
Step : Putting them in the dict indexing
Output:
Key
Value document it appeared and number of time
quick
brown
fox
jumps
lazy
dog
austin
capital
texas
houston
going
home
shall
eat
reach
university
Fire the Query
if a user inputs the word: "Austin"
the program should output: "The word Austin is repeated two times,
time in doc
time in doc
The output can be as fancy as you want it to be as long as it does not deviate from the goal
Make sure that the user can only input one word at a time. The program should only stop if the user inputs the letter q
Step by Step Solution
There are 3 Steps involved in it
1 Expert Approved Answer
Step: 1 Unlock
Question Has Been Solved by an Expert!
Get step-by-step solutions from verified subject matter experts
Step: 2 Unlock
Step: 3 Unlock
