Question: You will build your language model from a given set of example texts. As the model is based on trigram counts, you must count how

You will build your language model from a given set of example texts. As the model is based on trigram counts, you must count how many times triples of consecutive words appear in each example text. Words should be treated case-sensitively, meaning "she" and "She" should be considered two different words. And, although the example texts may contain punctuation, you should not treat it specially. That is, if the file contains the phrase "he, she, I", then you can consider the first word as "he,", the second as "she," and the third as "I". Said another way, process your example files as if they contained no punctuation, and consider the two words "she" and "she," as two different words.

You must write a C++ program which when built, creates an executable file named hw7a that takes two command-line arguments. The first argument is the name of a text file containing a list of input filenames.

In order to treat the beginning and end of your example files meaningfully during Part B, you will include in the model you create in Part A the special words "", "" (to indicate the start of each document), and "", "" (to indicate the end of each document). In particular, suppose your example text begins with words a b and ends with words c d. Then you must add into your model the four trigrams "", , a

, a, b c, d, ""

d, "",

And you will need to add four similar trigrams for each example text that you process.

Each time your program is run, it should build your trigram-based language model by processing each text file specified in the input filename list. What happens after that will depend on the second argument specified at the command line. The second argument is a single letter, and should be one of "a", "r", or "c". Your program should output to the C++ standard output stream (cout) the language model you created, ordering entries as specified by the argument letter as follows:

a - forward alphabetical order. This means that trigrams are output in alphabetical order by the first word in each trigram, using the alphabetical order of the second and then third word in each trigram to break ties.

r - reverse alphabetical order. This means that trigrams are output in descending alphabetical order by the first word in each trigram, using the descending alphabetical order of the second and then third word in each trigram to break ties.

c - count order. The means that trigrams are output in ascending order by frequency, using forward alphabetical ordering of first words and then second and then third words to break ties.

Your output will consist of one trigram with associated count per line. On a given line, the 4 outputs (trigramWord1, trigramWord2, trigramWord3, and count) should be separated by single spaces.

Example

Suppose the list of training texts input for your program resides in a file named tiny_ex.txt, and the contents of the file are names of text files containing excerpts from Dr. Seuss books as follows (click the links to see the contents of the two text files): sl.txt

ge.txt

For the command ./hw7a tiny_ex.txt a, the expected output is:

I 1

theyve 1

I do 1

theyve talked 1

Clause. 1

I do not 2

Santa Clause. 1

a lot about 1

about flaws. theyve 1

about gauze. theyve 1

about laws and 1

about old Santa 1

about paws and 1

and theyve talked 2

anywhere 1

do not like 2

flaws. theyve talked 1

gauze. theyve talked 1

here or there 1

laws and theyve 1

like them anywhere 1

like them here 1

lot about old 1

not like them 2

old Santa Clause. 1

or there I 1

paws and theyve 1

quite a lot 1

talked about flaws. 1

talked about gauze. 1

talked about laws 1

talked about paws 1

talked quite a 1

them anywhere 1

them here or 1

there I do 1

theyve talked about 4

theyve talked quite 1

For the command ./hw7a tiny_ex.txt c, the expected output is:

I 1

theyve 1

I do 1

theyve talked 1

Clause. 1

Santa Clause. 1

a lot about 1

about flaws. theyve 1

about gauze. theyve 1

about laws and 1

about old Santa 1

about paws and 1

anywhere 1

flaws. theyve talked 1

gauze. theyve talked 1

here or there 1

laws and theyve 1

like them anywhere 1

like them here 1

lot about old 1

old Santa Clause. 1

or there I 1

paws and theyve 1

quite a lot 1

talked about flaws. 1

talked about gauze. 1

talked about laws 1

talked about paws 1

talked quite a 1

them anywhere 1

them here or 1

there I do 1

theyve talked quite 1

I do not 2

and theyve talked 2

do not like 2

not like them 2

theyve talked about 4

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!