Question: project for this semester is to simulate a search engine over a collection (corpus) of documents. This project will be divided into three phases. The

project for this semester is to simulate a search engine over a collection ("corpus") of

documents. This project will be divided into three phases. The requirements described here are

for Phase 1.

Phase 1 is broken down into various tasks which will be used in subsequent phases as well. Each

Task has to be uploaded to Blackboard system by syllabus deadline in the corresponding

assignment content folder for Phase1:

IR.P1.Task#1)

You will need to maintain your own corpus of documents for the semester. To do

so, come up with 10 neutral (ie no controversy) queries (for example: Who was the 16th U.S.

President?) that you will submit to a search engine of your choice. Upload these ten queries to

Blackboard system for

IR.P1.Task#1.

IR.P1.Task#2)

You are to then download the first 20 (non-controversial) webpage (html)

responses that the search engine returns with, for each of the 10 queries (this is manually done;

you have to download it one by one; fortunately, you only have to d-o this once).

There will be a

total of 200 html files. (We will be discussing shortly in class how to process these using the Java

Regex package. You may NOT use 3rd party code. You MUST writ,e-your own. You do not need

regex necessarily but it does for provide much more concise code.) Place all 200 html files in a

directory named Corpus and compress/zip the entire directory. Upload this to Blackboard system

for

IR.P1.Task#2

. YOU SHOULD ASSUME (IN GENERAL) THAT NO CREDIT WILL BE

GIVEN FOR SHARED FILES OR LINKS TO FILES. HOWEVER, FOR THIS TASK, if the

Blackboard system limits do not allow you to upload this compressed file, then you can store it

on a cloud and upload to Blackboard system a secure link to that file.

IR.P1.Task#3)

Identify a Stoplist (either download or compute in a separate code on your own)

and store it in a hash structure. Program code is needed for the storage of the stopword list into a

hash structure and the ability to output your hash structure to an output text file. Upload the .java

files necessary to accomplish this to the Blackboard system for

IR.P1.Task#3.

IR.P1.Task#4)

In Java code, compute an Inverted Index collectively storing info for files that are

part of the corpus. See the following links for an explanation of what an Inverted Index is (and

what is not, such as a forward index):

https://www.geeksforgeeks.org/inverted-index/

https://www.geeksforgeeks.org/difference-inverted-index-forward-index/

You are to use either Java hashmaps or hashtables for storing the inverted index of your corpus.

(Separate email will provide tutorial links for hashtables.) What information should you store in

the inverted index for each significant (ie non-stopword) found in one of your documents? a) the

word; b) the name of document found in; c) a vector specifying for each occurrence of the word

in a document, how many words from beginning of document was it found (for this count

include even the stopwords). You need to do this for every word in every document that is not a

IR.P1.Task#4.

stopword. Upload this Java Code to the Blackboard system for

IR.P1.Task#5)

The code for each phase has to be compiled using javac (jdk compiler) and

executed using the java command (jdk runtime environment). Important names of files etc. will

be provided on the command line of the java command using "flags." Details about the usage of

flags for this phase will be emailed to you separately and discussed below. Further Code issues

will be explained in a separate email on Command Line Parsing. Please note that ALL phases of

this project will be run from the command line only. Upload the Java Code that processes the

project command line and its flags to the Blackboard system for

IR.P1.Task#5.

IR.P1.Task#6)

You will need to demonstrate the ability to "query" your inverted index for such

information as a) does a specific word appear in any document? b) how many documents (and

which) does a given word appear in; c) how many times (frequency) does a word appear in a

given document. The project will need to creat'e and utilize a -SEARCH flag in conjunction with

an output flag indicating which file the output should go to: -output=OutputFileName

-SEARCH=WORD

-- would search the Inverted Index for the given WORD and return with

which documents does the word appear in and specifically how many times appears in that

document.

-SEARCH=DOC

-- would

search Inverted Index for the given Document and return all words

found in that DOC with specifically how many times appears in that document.

NOTE:

For these commands, you may also need to pass other parameters via the command line

using appropriately named flags.

Upload the Java code files that implement these functions to Blackboard system for

IR.P1.Task#6.

IR.P1.Task#7)

This task is predicated on Task#6 being completed. Demonstrate one example of

a word search and one example of a doc search. You will upload three files to the Blackboard

system (either individually or in one compressed/zipped file, BUT ALL) for

IR.P1.Task#7

). The

first file is Searches.txt describing these searches and the actual commands the user needs on

command line to run your project and achieve these searches. In addition, upload the two output

files corresponding to the two searches. (Use different names for each output file.)

IR.P1.Task#8)

The system should be able to printout the inverted index or other relevant

information pertaining to a given document. The project will need to creat'e and utilize a -PRINT

flag in conjunction with an output flag indicating which file the output should go to: -output =

OutputFileName

-PRINT_INDEX=WORD

-- would print all the information contained in the Inverted Index for

the given WORD into the output file. The exact format is left up to you, but it must contain all of

the information.

- PRINT_INDEX=DOC

-- would print all the information contained in the Inverted Index for

the given DOC into the output file. The exact format is left up to you, but it must contain all of

the information.

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer

Step: 1 Unlock

Below are some suggestions and guidelines for each task Task 1 Queries Generate 10 neutral queries f... View full answer

Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock

Step: 3 Unlock

Students Have Also Explored These Related Programming Questions!

MiSTi, like many small technology companies, was born as an extension of the founder's special technical skills in the highly specialized field of "micro-switch" technology in the late 90's. Under...

Googles ease of use and superior search results have propelled the search engine to its num- ber one status, ousting the early dominance of competitors such as WebCrawler and Infos- eek. Even later...

Let A, B be sets. Define: (a) the Cartesian product (A B) (b) the set of relations R between A and B (c) the identity relation A on the set A [3 marks] Suppose S, T are relations between A and B, and...

Shapiro Inc. was incorporated in 2013 to operate as a computer software service firm with an accounting fiscal year ending August 31. Shapiro's primary product is a sophisticated online...

As mentioned, for relative-frequency polygons, we label the horizontal axis with class marks in limit grouping and class midpoints in cutpoint grouping. How do you think the horizontal axis is...

Christy Reed has been depositing $2,000 in her savings account every December since 2001. Her account earns 7 percent compounded annually. How much will she have in December 2012? (Assume that a...

Explain what it means to describe multiculturalism as the fourth force in clinical psychology.

Full Disclosure Principle Presented below are a number of facts related to Weller, Inc. Assume that no mention of these facts was made in the financial statements and the related notes. Instructions...

What is TCP / IP model? Briefly explain the operation of networks at each layer.

As of December 31, 2021 Jaxon and Jordan Associates had the following equity balances: Stockholders' Equity December 31, 2021 Common stock - $0.50 par value, authorized 1,000,000 shares $ 100,000...

please help Continuing Payroll Project: Prevosti Farms and Sugarhouse - EERF (Algo) Prevosu Farms and Sugarhouse pays. Its employees according to their job classification. The following employees...

The following independent situations require professional judgment for determining when to recognize revenue from the transactions. Identify when revenue should be recognized in each of the...

1)Jack works for a polymer firm that has developed a new insulation material that has the potential to make newly constructed homes more energy efficient. He is asked to develop a voice of the...

BPR always involves automation. Group of answer choices True False

P.14 A clinic specializes in shoulder injuries. A patient is randomly selected from the population of all clinic clients. Let S be the number of doctor visits for shoulder problems in the past six...

What is the actual value of each share of stock for Square Inc? I am trying to use the RIM valuation to determine the actual value. I am not able to figure out the discount rate. Here is the...

A local politician is concerned that a program for the homeless in her city is discriminating against blacks and other minorities. The following data were taken from a random sample of black and...

Refer to Exercise 8.32. Perform a Kruskal-Wallis test (with = .05), and compare your results to those in Exercise 8.32. In exercise 2.5 3.6 28 2.7 3.1 3.4 2.9 3.5 3.6 3.9 4.1 4.3 2.9 3.5 3.8 3.7...

Refer to Exercises 16.15 and 16.16. a. Assuming parallelism of the response lines, perform a test for block differences adjusted for the covariate. Use Î± = .05. b. How might you...

Refer to Exercise 19.9. In Exercise 19.9 Refer to Exercise 19.6. Suppose upon examining the data logs from the study the researchers determined that the CO emissions monitoring device was probably...

Describe the role played by defensive behaviors in a conflict situation and explain how to engage and encourage supportive behaviors.

Explain why violence is not a fact of life.

Explain the key concepts and assumptions that identify factors that play an important role in interpersonal conflict according to each theory.