Question: Create a program in C++ that compares text files and calculates their similarity by word. The user will be prompted to list as many text

Create a program in C++ that compares text files and calculates their similarity by word. The user will be prompted to list as many text files as they like (3, 18, 2345, etc), all separated by spaces. These files will then be compared against each other in the words used, forgetting the multiplicity of each word.

Every file should be compared against every other file exactly once, with an arbitrary number of files.

When two files are compared, all the words they had in common (intersection) should be documented. All words that were present in either file (union) should be documented. In documenting the union and intersection, the word frequency does not matter. A word appearing once or a hundred times between two files, for example, should only be documented once.

A similarity rating is defined by the percentage of overlap between the set of all words common to both files (intersection) and the set of all words, words appearing in either file (union).

For each pairwise comparison, a file should be created and saved with name [FileName]_[OtherFileName]_Sim_Score_[PERCENT].txt where [PERCENT] is the similarity score, rounded to the nearest integer, [FileName] is the name of one 1 file without any .s, and [OtherFileName] is the name of the other file without any .s.

Each saved file should be formatted as follows:

Similarity Score: [SIMILARITY SCORE]

Union: [ALL WORDS BETWEEN TWO FILES LISTED ALPHABETICALLY WITH NUMBERS COMING FIRST, SEPARATED BY SPACES]

Intersection: [ALL WORDS THAT ARE COMMON BETWEEN THE TWO FILES LISTED ALPHABETICALLY, SEPARATED BY SPACES]

The requirements:

1. You must overload at least 1 operator, but you may overload more. Hint: if you do overload operator+, the return type need not be a reference.

2. You may use std::vector and std::string, but you must also use at least one other data structure. There is one that is very natural for this setting.

3. You must assume the users files will be in the same folder as the .cpp or .exe file, and the files must be saved to the same folder.

4. You may assume the files will not use digits 0-9 nor will they use any punctuation marks.

5. The words listed in the files you produce must be listed alphabetically.

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!