project for this semester is to simulate a search engine over a collection (corpus) of documents. This
Question:
project for this semester is to simulate a search engine over a collection ("corpus") of
documents. This project will be divided into three phases. The requirements described here are
for Phase 1.
Phase 1 is broken down into various tasks which will be used in subsequent phases as well. Each
Task has to be uploaded to Blackboard system by syllabus deadline in the corresponding
assignment content folder for Phase1:
IR.P1.Task#1)
You will need to maintain your own corpus of documents for the semester. To do
so, come up with 10 neutral (ie no controversy) queries (for example: Who was the 16th U.S.
President?) that you will submit to a search engine of your choice. Upload these ten queries to
Blackboard system for
IR.P1.Task#1.
IR.P1.Task#2)
You are to then download the first 20 (non-controversial) webpage (html)
responses that the search engine returns with, for each of the 10 queries (this is manually done;
you have to download it one by one; fortunately, you only have to d-o this once).
There will be a
total of 200 html files. (We will be discussing shortly in class how to process these using the Java
Regex package. You may NOT use 3rd party code. You MUST writ,e-your own. You do not need
regex necessarily but it does for provide much more concise code.) Place all 200 html files in a
directory named Corpus and compress/zip the entire directory. Upload this to Blackboard system
for
IR.P1.Task#2
. YOU SHOULD ASSUME (IN GENERAL) THAT NO CREDIT WILL BE
GIVEN FOR SHARED FILES OR LINKS TO FILES. HOWEVER, FOR THIS TASK, if the
Blackboard system limits do not allow you to upload this compressed file, then you can store it
on a cloud and upload to Blackboard system a secure link to that file.
IR.P1.Task#3)
Identify a Stoplist (either download or compute in a separate code on your own)
and store it in a hash structure. Program code is needed for the storage of the stopword list into a
hash structure and the ability to output your hash structure to an output text file. Upload the .java
files necessary to accomplish this to the Blackboard system for
IR.P1.Task#3.
IR.P1.Task#4)
In Java code, compute an Inverted Index collectively storing info for files that are
part of the corpus. See the following links for an explanation of what an Inverted Index is (and
what is not, such as a forward index):
https://www.geeksforgeeks.org/inverted-index/
https://www.geeksforgeeks.org/difference-inverted-index-forward-index/
You are to use either Java hashmaps or hashtables for storing the inverted index of your corpus.
(Separate email will provide tutorial links for hashtables.) What information should you store in
the inverted index for each significant (ie non-stopword) found in one of your documents? a) the
word; b) the name of document found in; c) a vector specifying for each occurrence of the word
in a document, how many words from beginning of document was it found (for this count
include even the stopwords). You need to do this for every word in every document that is not a
IR.P1.Task#4.
stopword. Upload this Java Code to the Blackboard system for
IR.P1.Task#5)
The code for each phase has to be compiled using javac (jdk compiler) and
executed using the java command (jdk runtime environment). Important names of files etc. will
be provided on the command line of the java command using "flags." Details about the usage of
flags for this phase will be emailed to you separately and discussed below. Further Code issues
will be explained in a separate email on Command Line Parsing. Please note that ALL phases of
this project will be run from the command line only. Upload the Java Code that processes the
project command line and its flags to the Blackboard system for
IR.P1.Task#5.
IR.P1.Task#6)
You will need to demonstrate the ability to "query" your inverted index for such
information as a) does a specific word appear in any document? b) how many documents (and
which) does a given word appear in; c) how many times (frequency) does a word appear in a
given document. The project will need to creat'e and utilize a -SEARCH flag in conjunction with
an output flag indicating which file the output should go to: -output=OutputFileName
-SEARCH=WORD
-- would search the Inverted Index for the given WORD and return with
which documents does the word appear in and specifically how many times appears in that
document.
-SEARCH=DOC
-- would
search Inverted Index for the given Document and return all words
found in that DOC with specifically how many times appears in that document.
NOTE:
For these commands, you may also need to pass other parameters via the command line
using appropriately named flags.
Upload the Java code files that implement these functions to Blackboard system for
IR.P1.Task#6.
IR.P1.Task#7)
This task is predicated on Task#6 being completed. Demonstrate one example of
a word search and one example of a doc search. You will upload three files to the Blackboard
system (either individually or in one compressed/zipped file, BUT ALL) for
IR.P1.Task#7
). The
first file is Searches.txt describing these searches and the actual commands the user needs on
command line to run your project and achieve these searches. In addition, upload the two output
files corresponding to the two searches. (Use different names for each output file.)
IR.P1.Task#8)
The system should be able to printout the inverted index or other relevant
information pertaining to a given document. The project will need to creat'e and utilize a -PRINT
flag in conjunction with an output flag indicating which file the output should go to: -output =
OutputFileName
-PRINT_INDEX=WORD
-- would print all the information contained in the Inverted Index for
the given WORD into the output file. The exact format is left up to you, but it must contain all of
the information.
- PRINT_INDEX=DOC
-- would print all the information contained in the Inverted Index for
the given DOC into the output file. The exact format is left up to you, but it must contain all of
the information.
An Introduction To Statistical Methods And Data Analysis
ISBN: 9781305465527
7th Edition
Authors: R. Lyman Ott, Micheal T. Longnecker