Question: Task 1 Analysing n-grams in a sample text (NgramAnalyser) For this task, you will need to complete the NgramAnalyser class, and add code to the
Task 1 Analysing n-grams in a sample text (NgramAnalyser)
For this task, you will need to complete the NgramAnalyser class, and add code to the ProjectTest class. The NgramAnalyser class analyses an input string, passed to it in the constructor, and counts all the n-grams of letters that occur in the string. An n-gram is simply a (contiguous) sequence of nn items from a piece of text the items we will be considering for this class are characters. (One could also analyse n-grams of words, syllables, or even sentences.) For instance, a 2-gram (also called a bigram) is a pair of characters, a 3-gram is a triple of characters, and so on.
By way of example, consider the following string:
"the rain in Spain"
The alphabet size is 10 (unique characters including spaces) and the 2-grams in the string are:
"th", "he", "e ", " r", "ra", "ai", "in", "n ", " i", "in", "n ", " S", "Sp", "pa", "ai", "in", "nt"
If we remove duplicates, they are:
"th", "he", "e ", " r", "ra", "ai", "in", "n ", " i", " S", "Sp", "pa", "nt"
And if we count how often each 2-gram appears in the input string, we get the following results:
| 2-gram | Frequency |
|---|---|
| " S" | 1 |
| " i" | 1 |
| " r" | 1 |
| "Sp" | 1 |
| "ai" | 2 |
| "e " | 1 |
| "he" | 1 |
| "in" | 3 |
| "n " | 2 |
| "pa" | 1 |
| "ra" | 1 |
| "th" | 1 |
| "nt" | 1* |
The NgramAnalyser class is given a string as input to its constructor, and optionally an n-gram size nn. It should analyse the n-grams in the input string, and record their frequencies in the hash-map ngram. It should also record the total number of distinct characters that appear in the input string (i.e., the alphabet used by the input string), and store this count in the field alphabetSize.
The analyses performed by methods in this class should consider the input string to wrap around: i.e., the first character in the string should be considered to follow the last character. Otherwise, there will be a number of n-grams at the end of the string which are shorter than n, and must be padded out in some way. More precisely: when counting frequencies of distinct n-grams in the input string, the final n1n1 n-grams in the input string should have added to them a sequence of contiguous characters from the start of the string in order to ensure they are of length nn. For instance, in the example given above: for the purposes of calculating n-gram frequencies, the last 2-gram in the input text would be "nt" (i.e., it is if the text ended in Spaint). And in the string "abbc", the 3-grams would be"abb", "bbc", "bca", and "cab".
Given this background information, complete the following sub-tasks. You may wish to complete sub-task (h) before commencing sub-task (g).
Fill in the @author and @version fields in the header comments with your name, student number and the date.
Complete the code for the constructor NgramAnalyser(int n, String inp). In addition to counting frequencies of n-grams, the constructor should record the number of distinct characters encountered, and store this in the alphabetSize field.
You should not change the case (upper or lower) of the input string, and should process all characters including spaces and punctuation. The constructor should throw an unchecked IllegalArgumentException if the input fields are unsuitable. The following cases are considered unsuitable: the input string is the empty string, the input string is null, the given n-gram size is 0, or n is greater than the length of the input string.
Complete the code for the getAlphabetSize() method. The alphabet size is the number of distinct characters in the input string.
Complete the code for the getNgramFrequency(String ngram) method. This method should return the frequency with which a particular n-gram appears in the text. If it does not appear at all, return 0.
Complete the code for the getDistinctNgramCount() method. This should return the number of distinct n-grams which appear in the input text.
Complete the code for the getNgramCount() method. This should return the total number of n-grams which appear in the input text i.e., duplicate n-grams are considered distinct.
Complete the code for the toString() method. This should return a string consisting of:
a header line, containing the size n of the n-grams that are being counted; and
one line for each distinct n-gram encountered. Each line should contain the characters appearing in the n-gram, then a space character, then the number of times that the n-gram appears in the input text. The lines should appear in alphabetic order of n-gram text.
For instance, if the string "abbc" were analysed, and frequencies of 3-grams were counted, then the output would be a string containing the following lines. Your output should match the spaces and line breaks exactly as shown in this example.
3 abb 1 bbc 1 bca 1 cab 1
Complete the code for the testSensibleToStringSize test in the ProjectTest. The minimum number of distinct n-grams that could appear in an input string is equal to the alphabet size of the string (since if a character appears at all in the string, and if we consider the string to wrap around, then it follows there is at least one distinct n-gram starting with that character). Therefore, the minimum number of lines that should be contained in the string returned by the toStringmethod, when the input string was of non-zero length and nn was greater than zero, is 1 (for the header line) plus the size of the alphabet for the string.
Write code for a test which ensures the result of the toString method contains at least this number of lines.
Ensure you complete Javadoc comments for each method in both the NgramAnalyser and ProjectTest classes including @param, @returns and @throws fields as required.
You may add additional methods to the NgramAnalyser class if necessary, but should not add additional fields, nor change the type or name of the existing fields, nor change the signatures of the existing methods.
The testGetDistinctNgrams test in the ProjectTest has been left incomplete. You are not required to add code to this method, but it is strongly suggested you try and think of tests you could apply to the results of your getDistinctNgrams method. For instance, for a given input string, can you think of some number which the size of the set returned by testGetDistinctNgrams should never be below? Is there some number it should never exceed? You are also encouraged to add other tests, if you can think of them (but ensure they compile correctly do not submit code that does not compile).
Ngram analyser class:
import java.util.ArrayList; import java.util.HashMap; import java.util.Set;
import java.util.HashSet; import java.util.Arrays;
/** * Perform n-gram analysis of a string. * * Analyses the frequency with which distinct n-grams, of length n, * appear in an input string. For the purposes of all analyses of the input * string, the final n-1 n-grams appearing in the string should be * "filled out" to a length of n characters, by adding * a sequence of contiguous characters from the start of the string. * e.g. "abbc" includes "bca" and "cab" in its 3-grams * * @author (your name) * @version (a version number or a date) */ public class NgramAnalyser { /** dictionary of all distinct n-grams and their frequencies */ private HashMap ngram;
/** number of distinct characters in the input */ private int alphabetSize;
/** n-gram size for this object (new field) */ private int ngramSize;
/** * Analyse the frequency with which distinct n-grams, of length n, * appear in an input string. * n-grams at the end of the string wrap to the front * e.g. "abbbbc" includes "bca" and "cab" in its 3-grams * @param int n size of n-grams to create * @param String inp input string to be modelled */ public NgramAnalyser(int n, String inp) { //TODO replace this line with your code }
/** * Analyses the input text for n-grams of size 1. */ public NgramAnalyser(String inp) { this(1,inp); }
/** * @return int the size of the alphabet of a given input */ public int getAlphabetSize() { //TODO replace this line with your code return -1; }
/** * @return the total number of distinct n-grams appearing * in the input text. */ public int getDistinctNgramCount() { //TODO replace this line with your code return -1; }
/** * @return Return a set containing all the distinct n-grams * in the input string. */ public Set getDistinctNgrams() { //TODO replace this line with your code return null; }
/** * @return the total number of n-grams appearing * in the input text (not requiring them to be distinct) */ public int getNgramCount() { //TODO replace this line with your code return -1; }
/** Return the frequency with which a particular n-gram appears * in the text. If it does not appear at all, return 0. * * @param ngram The n-gram to get the frequency of * @return The frequency with which the n-gram appears. */ public int getNgramFrequency(String ngram) { //TODO replace this line with your code return -1; }
/** * Generate a summary of the ngrams for this object. * @return a string representation of the n-grams in the input text * comprising the ngram size and then each ngram and its frequency * where ngrams are presented in alphabetical order. */ public String toString() { //TODO replace this line with your code return null; }
}
Project test class:
import static org.junit.Assert.*; import org.junit.After; import org.junit.Before; import org.junit.Test;
/** * The test class ProjectTest for student test cases. * Add all new test cases to this task. * * @author (your name) * @version (a version number or a date) */ public class ProjectTest { /** * Default constructor for test class ProjectTest */ public ProjectTest() { }
/** * Sets up the test fixture. * * Called before every test case method. */ @Before public void setUp() { }
/** * Tears down the test fixture. * * Called after every test case method. */ @After public void tearDown() { } //TODO add new test cases from here include brief documentation @Test(timeout=1000) public void testSensibleToStringSize() { assertEquals(0,1); //TODO replace with test code }
@Test(timeout=1000) public void testGetDistinctNgrams() { assertEquals(0,1); //TODO replace with test code } @Test(timeout=1000) public void testLaplaceExample() { assertEquals(0,1); //TODO replace with test code } @Test(timeout=1000) public void testSimpleExample() { assertEquals(0,1); //TODO replace with test code }
@Test public void testTask3example() { MarkovModel model = new MarkovModel(2,"aabcabaacaac"); ModelMatcher match = new ModelMatcher(model,"aabbcaac"); assertEquals(0,1); //TODO replace with test code } }
Step by Step Solution
There are 3 Steps involved in it
Get step-by-step solutions from verified subject matter experts
