Question: In this assignment, you will create a simplified UNIX Information Retrieval system, a search engine as a concrete instance of a dictionary, and well use

In this assignment, you will create a simplified UNIX Information Retrieval system, a search engine as a concrete instance of a dictionary, and well use it to look up information about Olympic athletes. Youll search how many medals an athlete won overall the competitions. There are two stages in this project. In each stage you will code a dictionary in the C programming language. A binary search tree will be the underlying data structure for both stages. In this assignment the search keys are not guaranteed to be unique. In this assignment we use variants of the binary search tree designed to handle duplicates, i.e. by either dividing nodes using <= >, or by using < > and a linked list for items with same key. You will use a Makefile to direct the compilation of two separate executable programs, one for Stage 1 and one for Stage 2, each of which uses a different variant of the binary search tree. In both stages of the assignment, you will insert records into the dictionary from a file. You will then look up and output the records (data) contained by the dictionary, counting and outputting the num- ber of key comparisons used in the search. 1 You will report on the number of key comparisons used for search, compare the number of key com- parisons used by each stage, and analyse what would have been expected theoretically. The report should cover each file used to initialize the dictionary. You are not required to implement the delete functionality.

Stage 1: In Stage 1 of this assignment, your Makefile will direct the compilation to produce an executable program called dict1. The program dict1 takes two command line arguments: the first argument is the name of the data file used to build the dictionary; the second argument is the name of the output file, containing the data located in the searches. The file consists of an unspecified number of records, one per line, with the following information: ID - Unique number for each athlete Name - Athletes name Sex - M or F Age - Integer Height - In centimeters Weight - In kilograms Team - Team name NOC - National Olympic Committee 3-letter code Games - Year and season Year - Integer Season - Summer or Winter City - Host city Sport - Sport Event - Event Medal - Gold, Silver, Bronze, or NA The file was originally posted in kaggle. You can read more about the dataset here: Data source. Although it is not needed for this assignment, you can click on the overview and kernel tab on the web page to learn more about your dataset. We cleaned the dataset to remove nicknames embedded bewtween double quotes within the field. The field is an alphabetic string of vary- ing length, containing the full name of an athlete. You may assume that this field contains no more than 128 characters. The other columns can be treated simply as the associated field. Build a data structure of strings to save the associated data collected about the athlete. The maximum size of any string can be 128 characters. Each string is separated by a comma ,. It is a standard csv format where the delimiter used is a comma. The dictionary key consists of the field. The is the information sought during lookup. Do not use the , the first column, as key. The ID is part of the . For the purposes of this assignment, you may assume that the input data is well-formatted, that the input file is not empty, and that the maximum length of an input record (a single full line of the csv file) is 512 characters. This number could help you fixing a reading buffer size. In this first stage of the assignment, you will: Construct a binary search tree to store the information contained in the file specified in the command line argument. Each record should be stored in a separate Node. Search the binary search tree for records, based on their keys. The keys are read in from stdin, i.e. from the screen. 2 For testing, it is often convenient to create a file of keys to be searched, one per line, and redirect the input from this file. Use the UNIX operator < for redirecting input from a file. Examples of use: dict1 datafile outputfile then type in keys; or dict1 datafile outputfile < keyfile Your program will look up each key and output the information (the data found) to the output file specified by the second command line parameter. If the key is not found in the tree, you must output the word NOTFOUND. The number of key comparisons performed during both successful and unsuccessful lookups have to be written to stdout. Remember that the entries in the file do not necessarily have unique keys. Your search must locate all keys matching the search key, and output all the data found. In Stage 1 of the assignment you will locate the duplicates by continuing your search until you reach a leaf node, regardless of whether or not you have already found a match or matches. Example output: output file (information): Cornelia Aalten (-Strannood) >ID: 8 Sex: F || Age: 18 || Height: 168 || Weight: NA || Team: Netherlands || NOC: NED || Games: 1932 Summer || Year: 1932 || Season: Summer || City: Los Angeles || Sport: Athletics || Event: Athletics Womens 100 metres || Medal: NA || Cornelia Aalten (-Strannood) >ID: 8 Sex: F || Age: 18 || Height: 168 || Weight: NA || Team: Netherlands || NOC: NED || Games: 1932 Summer || Year: 1932 || Season: Summer || City: Los Angeles || Sport: Athletics || Event: Athletics Womens 4 x 100 metres Relay || Medal: NA || Nir Lipo > NOTFOUND stdout (comparisons): Cornelia Aalten (-Strannood) > 423 Nir Lipo > 401 Note that the key is output to both the file and to stdout, for identification purposes. Also note that the number of comparisons is only output at the end of the search, so there is only one number for key comparisons per key, even when multiple records have been located for that key. The format need not be exactly as above. Variations in whitespace/tabs are permitted. The number of comparisons above has been made up, do not take it as an example of a correct execution.

Stage 2: In Stage 2, you will code a dictionary where all the duplicate keys in the dictionary are returned, as previously, and additionally where the search is more efficient than in Stage 1. Input and output are as for Stage 1, with the information or NOTFOUND written to a file and the number of comparisons made during the search written to stdout. In Stage 2, however, you will structure your tree so that once a key is found, all duplicate keys can be found without further key comparisons. Note that comparing a key to NULL is not a full (costly) key comparison, and is not counted as a key comparison in Stage 2 of this assignment when building the report.

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!