Strings are a stark data structure that allows random access to the linear text stored within, the
Question:
Strings are a stark data structure that allows random access to the linear text stored within, the universal and portable representation of all information. Under the hood, strings are character arrays, but the primitive nature of such arrays is hidden behind a better interface of public methods of the String class. Surprising algorithms have been developed over the years to quickly search and analyze both the string itself and the semantic information encoded in it. Even the seemingly simple task of searching for a pattern inside the given text can be optimized in countless ways, let alone various higher level tasks that are performed on these string instruments. Occasionally the string data structure can be augmented with additional data that speeds up operations that would be inefficient if given only the raw and unadorned text string itself. Even though this data in principle adds nothing new to the mix in that its structure is fully determined by the original string itself, having that information available can provide significant speedups as yet another form of space-time tradeoff. This lab showcases a simple but powerful suffix array method as a way to preprocess any text so that after the metaphorical cheque for the one-time payment of this preprocessing has cleared, all future searches of arbitrary patterns can be executed in time that grows only logarithmically with respect to the length of the text. This makes these pattern searches blazingly fast even when performed on the entire War and Peace! Computed from the given text string with n characters, its suffix array is an n-element array of integers that contains the position indices 0, ..., n-1 to the text. Since each position is stored in the suffix array exactly once, the suffix array is automatically some permutation of order n. The suffix array lists these n positions sorted in the lexicographic order (also called the "dictionary order") of the suffixes that start from each position. Note that the presentation on the linked Wikipedia page on suffix arrays uses one-based position indexing to the string, whereas here we naturally use Java's zero-based indexing to remain consistent to our principles. For example, the non-empty suffixes of the string "hello" are "hello", "ello", "llo", "lo" and "o", corresponding to positions ranging from 0 to 4 in the original string. Sorting these positions to the lexicographic order of the corresponding suffixes gives these positions in the order [1, 0, 2, 3, 4], which then becomes the suffix array of the original string. Since all n suffixes of text have a different length, the possibility of equality in their lexicographic comparison can never arise. This guarantees the uniqueness of the suffix array for the given text. For reasons of simplicity and convenience, suffix arrays are represented in this lab as instances of some subtype of List, despite the use of the standard technical term of "suffix array". I need to make a new class named P2J11, and there first the method public static List buildSuffixArray(String text) that builds and returns the suffix array for the given text. In this lab, this can be done with the naive algorithm, since the structure of our test case strings guarantees that the worst case scenario of this algorithm, a string whose all characters are the same, is never realized. The easiest way to implement this naive algorithm in Java should be to define yet another custom subtype of Comparator whose method compareTo lexicographically compares the two substrings that start from the positions that it receives as parameters. In this method, first define the local variable ArrayList result, and fill it up the brim with the integers from 0 to n-1. Then, just use the utility method Collections.sort to sort this result list with your custom Comparator. This discipline should allow your buildSuffixArray method to be reasonably fast even when building the suffix array for the entire War and Peace, as performed by the JUnit test class using the warandpeace.txt data file. Once the suffix array has been constructed for the given fixed text, it can be used to rapidly find all positions inside text where the given pattern occurs. These would be precisely the very same positions whose suffixes start with that pattern! Having already done the work of sorting these suffixes in lexicographic order in the previous preprocessing step allows us to use a slightly modified binary search performed to the suffix array to find the lexicographically first such suffix. Since all suffixes that start with pattern must necessarily follow each other as a bunch in the sorted array of such suffixes, looping through all these positions is a straightforward while-loop once the binary search has determined the first such position. To achieve all that, write the second method public static List find(String pattern, String text, List suffix) that creates and returns a list of positions of the original text that contain the given pattern, using the given suffix array to speed up this search. Note that this returned list of positions must be sorted in ascending order of the positions themselves, instead of the lexicographic order of the suffixes of text that start at these positions.
Practical Introduction To Data Structures And Algorithm Analysis Java Edition
ISBN: 9780136609117
1st Edition
Authors: Clifford A. Shaffer