Question: FINDING SIMILARITIES BETWEEN TEXT FILES C++ This assignment is focused on building familiarity with streams, operator overloading, and data structures. There are some requirements to
FINDING SIMILARITIES BETWEEN TEXT FILES C++
This assignment is focused on building familiarity with streams, operator overloading, and data structures. There are some requirements to follow in terms of how the program is supposed to behave and a checklist of requirements, but most of the logic and how you organize your file(s) is up to you. In this assignment, you are to create a simple vocabulary comparison tool. The user will be prompted to list as many files as they like ( could be 2, 3, 4, 10, etc.), all separated by spaces. These files will then be compared against each other in the words used, neglecting multiplicity of each word. A rough procedure is below:
Every file should be compared against every other file exactly once. Keep in mind there can be an arbitrary number of files!
When two files are compared, all the words they had in common (intersection) should be documented; also, all words that were present in either file (union) should be documented. In documenting the union and intersection, the word frequency does not matter! A word appearing once or a hundred times between two files, for example, should only be documented once.
The words should be stored as all lowercase, so all uppercase letters should be made lowercase.
The words should be stored without any punctuation marks, which we will assume are among the list: . , ! ; ? : .
A similarity rating is defined by the percentage of overlap between the set of all words common to both files (intersection) and the set of all words, words appearing in either file (union).
For each pairwise comparison, a file should be created and saved with name [FileName] [OtherFileName] Sim Score [PERCENT].txt where [PERCENT] is the similarity score, rounded to the nearest integer, [FileName] is the name of one 1 file without any .s, and [OtherFileName] is the name of the other file without any .s.
Each saved file should be formatted as below:
Similarity Score: [SIMILARITY SCORE]
Union: [ALL WORDS BETWEEN TWO FILES LISTED ALPHABETICALLY WITH NUMBERS COMING FIRST, SEPARATED BY SPACES]
Intersection: [ALL WORDS THAT ARE COMMON BETWEEN THE TWO FILES LISTED ALPHABETICALLY, SEPARATED BY SPACES]
The requirements: 1. You must overload at least 1 operator, but you may overload more. Hint: if you do overload operator+, the return type need not be a reference.
2. You may use std::vector and std::string, but you must also use at least one other data structure, set.
3. You must assume the users files will be in the same folder as the .cpp or .exe file, and the files must be saved to the same folder.
4. You may assume the files will not use digits 0-9 nor will they use any punctuation not appearing in the list previously provided.
5. You may assume punctuation marks will only ever be found immediately preceding a word or immediately after a word, with no white space between the word and the punctuation.
6. The words listed in the files you produce must be listed alphabetically.
7. You must carefully document all functions that you write. As a simple example with only 2 files, consider:
File1.txt: Today is Thursday the first.
File2.txt: If today is Thursday, tomorrow is Friday!
There should then be a file generated called File1txt_File2txt_Sim_Score_38.txt that reads:
Similarity Score: 38
Union: first friday if is the thursday today tomorrow
Intersection: is thursday today
Explanation: All of the words are: first, friday, if, is, the, thursday, today, tomorrow 8 words The common words are: is, thursday, today 3 words The similarity score would be 38 (from 3/8 = 37.5% being rounded to 38%).
Step by Step Solution
There are 3 Steps involved in it
Get step-by-step solutions from verified subject matter experts
