Question: Problem Description Write a Java program that will build an Inverted Index for an Information Retrieval System from a collection of text documents in a
Problem Description
Write a Java program that will build an Inverted Index for an Information Retrieval System from a collection of text documents in a given directory. Your program will prompt for two pieces of information: the input directory, and the inverted index output file. When the input directory is received, your program will read the files contained in that directory and build the inverted index. Your program should then write the inverted index to the output file. You must submit your source code. Do not submit any executables or input data documents. This version requires you to write the portion of the program that performs linguistic processing on each token read by the system.
Your program must describe how you address the following issues:
Capitalization o Your terms should not have capitalized letters unless it is an acronym
Punctuation o Your terms should not have punctuation unless it is an integral part of the token (e.g. C++, wont, merry-go-round)
Stemming o Your program should attempt some type of stemming
Stop Words o Your program should determine which words are too common to add to the index
Your program must address each issue. It may not ignore any of these issues. You must also write the code that adds each term to the index. You must complete the section of the start() method that takes each term and searches for it in the index. If not found, it is added to the index along with its posting list. You are not required to use the incomplete version and may write your own program. However, your program should adhere to the requirements of this assignment. Your main class must be called HW1. The class creates an instance of another class (IRSystem) and calls its start() method. Modify the program header in that file to complete the requested information (such as name, date, and program description). Your instructor will compile and execute your program expecting HW1 to be the main class. Failure to do so may cause your program to be rejected as a compilation error. Comments/Format Points can be deducted from your assignment based on the quality of its presentation. This includes ensuring your documentation is free from grammatical and spelling mistakes Make sure the version of your program you submit will successfully compile and execute
Source code:
HW1.java :
/************************************************************************** *** Name: *** *** Class: COSC 4315.001 *** *** Instructor: Dr. Brown *** *** Date: *** *** Description: *** **************************************************************************/
public class HW1 { public static void main (String args[]) throws Exception { IRSystem ir = new IRSystem(); ir.start(); } }
IRSystem.java:
import java.io.*; import java.util.*;
class IRSystem { private File[] collection; private File outfile; private InvertedIndex index;
public IRSystem() { collection = getFiles(); outfile = getOutfile(); index = new InvertedIndex(); } public File[] getFiles() { File[] files = null; try { System.out.println(); System.out.print("Enter name of a directory> "); Scanner scan = new Scanner(System.in); File dir = new File(scan.nextLine()); files = dir.listFiles(); } catch (Exception e) { System.out.println("Caught error in getFiles: " + e.toString()); } return files; }
public File getOutfile() { File f = null; try { System.out.println(); System.out.print("Enter name of output file> "); Scanner scan = new Scanner(System.in); f = new File(scan.nextLine()); } catch (Exception e) { System.out.println("Caught error in getOutfile: " + e.toString()); } return f; }
public String linguisticProcessing(String token) { /***************************************************************************************** *** Convert each token into a term to be stored in the index. *** *** Describe your process in this comment section. *** *** In particular, describe how you handle *** *** capitalization, *** *** stemming, *** *** punctuation (including apostrophes), *** *** stop words *** *****************************************************************************************/ String term = "";
term = token; /* I hope it is obvious that more String processing is needed */
return term; }
public void start() { try { int docID = 0; for (File f : collection) { Scanner sc = new Scanner(f); while (sc.hasNextLine()) { StringTokenizer st = new StringTokenizer(sc.nextLine()); while (st.hasMoreTokens()) { String term = linguisticProcessing(st.nextToken());
// Add term to index if one is returned from linguisticProcessing if (term.length() >= 1) { /************************************************************************************************************* *** Call find method to determine if the term is already in the index *** *** If the term is new, call add method to add it to the index along with the current document ID (docID) *** *** Otherwise, get the Posting list for the term *** *** Determine if the current document ID is already in the Posting list *** *** If not, use the add method to add the documentID to the posting. *** *************************************************************************************************************/ } } } docID++; } index.print(outfile); } catch(Exception e) { System.out.println("Error in start: " + e.toString()); } } }
}
Step by Step Solution
There are 3 Steps involved in it
Get step-by-step solutions from verified subject matter experts
