Question: project for this semester is to simulate a search engine over a collection (corpus) of documents. This project will be divided into three phases. The

 project for this semester is to simulate a search engine over a collection ("corpus") of

documents. This project will be divided into three phases. The requirements described here are

for Phase 1.

Phase 1 is broken down into various tasks which will be used in subsequent phases as well. Each

Task has to be uploaded to Blackboard system by syllabus deadline in the corresponding

assignment content folder for Phase1:

IR.P1.Task#1)

You will need to maintain your own corpus of documents for the semester. To do

so, come up with 10 neutral (ie no controversy) queries (for example: Who was the 16th U.S.

President?) that you will submit to a search engine of your choice. Upload these ten queries to

Blackboard system for

IR.P1.Task#1.

IR.P1.Task#2)

You are to then download the first 20 (non-controversial) webpage (html)

responses that the search engine returns with, for each of the 10 queries (this is manually done;

you have to download it one by one; fortunately, you only have to d-o this once).

There will be a

total of 200 html files. (We will be discussing shortly in class how to process these using the Java

Regex package. You may NOT use 3rd party code. You MUST writ,e-your own. You do not need

regex necessarily but it does for provide much more concise code.) Place all 200 html files in a

directory named Corpus and compress/zip the entire directory. Upload this to Blackboard system

for

IR.P1.Task#2

. YOU SHOULD ASSUME (IN GENERAL) THAT NO CREDIT WILL BE

GIVEN FOR SHARED FILES OR LINKS TO FILES. HOWEVER, FOR THIS TASK, if the

Blackboard system limits do not allow you to upload this compressed file, then you can store it

on a cloud and upload to Blackboard system a secure link to that file.

IR.P1.Task#3)

Identify a Stoplist (either download or compute in a separate code on your own)

and store it in a hash structure. Program code is needed for the storage of the stopword list into a

hash structure and the ability to output your hash structure to an output text file. Upload the .java

files necessary to accomplish this to the Blackboard system for

IR.P1.Task#3.

IR.P1.Task#4)

In Java code, compute an Inverted Index collectively storing info for files that are

part of the corpus. See the following links for an explanation of what an Inverted Index is (and

what is not, such as a forward index):

https://www.geeksforgeeks.org/inverted-index/

https://www.geeksforgeeks.org/difference-inverted-index-forward-index/

You are to use either Java hashmaps or hashtables for storing the inverted index of your corpus.

(Separate email will provide tutorial links for hashtables.) What information should you store in

the inverted index for each significant (ie non-stopword) found in one of your documents? a) the

word; b) the name of document found in; c) a vector specifying for each occurrence of the word

in a document, how many words from beginning of document was it found (for this count

include even the stopwords). You need to do this for every word in every document that is not a

IR.P1.Task#4.

stopword. Upload this Java Code to the Blackboard system for

IR.P1.Task#5)

The code for each phase has to be compiled using javac (jdk compiler) and

executed using the java command (jdk runtime environment). Important names of files etc. will

be provided on the command line of the java command using "flags." Details about the usage of

flags for this phase will be emailed to you separately and discussed below. Further Code issues

will be explained in a separate email on Command Line Parsing. Please note that ALL phases of

this project will be run from the command line only. Upload the Java Code that processes the

project command line and its flags to the Blackboard system for

IR.P1.Task#5.

IR.P1.Task#6)

You will need to demonstrate the ability to "query" your inverted index for such

information as a) does a specific word appear in any document? b) how many documents (and

which) does a given word appear in; c) how many times (frequency) does a word appear in a

given document. The project will need to creat'e and utilize a -SEARCH flag in conjunction with

an output flag indicating which file the output should go to: -output=OutputFileName

-SEARCH=WORD

-- would search the Inverted Index for the given WORD and return with

which documents does the word appear in and specifically how many times appears in that

document.

-SEARCH=DOC

-- would

search Inverted Index for the given Document and return all words

found in that DOC with specifically how many times appears in that document.

NOTE:

For these commands, you may also need to pass other parameters via the command line

using appropriately named flags.

Upload the Java code files that implement these functions to Blackboard system for

IR.P1.Task#6.

IR.P1.Task#7)

This task is predicated on Task#6 being completed. Demonstrate one example of

a word search and one example of a doc search. You will upload three files to the Blackboard

system (either individually or in one compressed/zipped file, BUT ALL) for

IR.P1.Task#7

). The

first file is Searches.txt describing these searches and the actual commands the user needs on

command line to run your project and achieve these searches. In addition, upload the two output

files corresponding to the two searches. (Use different names for each output file.)

IR.P1.Task#8)

The system should be able to printout the inverted index or other relevant

information pertaining to a given document. The project will need to creat'e and utilize a -PRINT

flag in conjunction with an output flag indicating which file the output should go to: -output =

OutputFileName

-PRINT_INDEX=WORD

-- would print all the information contained in the Inverted Index for

the given WORD into the output file. The exact format is left up to you, but it must contain all of

the information.

- PRINT_INDEX=DOC

-- would print all the information contained in the Inverted Index for

the given DOC into the output file. The exact format is left up to you, but it must contain all of

the information.





Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock

Below are some suggestions and guidelines for each task Task 1 Queries Generate 10 neutral queries f... View full answer

blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Programming Questions!