Question: Overview Reads a plain text file Builds a Huffman tree from the character frequencies in the file Outputs the original length in bits, Huffman compression
Overview
- Reads a plain text file
- Builds a Huffman tree from the character frequencies in the file
- Outputs the original length in bits, Huffman compression length in bits, and the alphabet|frequency|code_length table.
Input
- The program accepts the input filename as a command-line argument: ./a.out input_filename.txt
- The input filename will contain less than 100,000 characters with ASCII values in the range [0,127] (ASCII Table Reference Link).
- Every single character in the input file will be included in the frequency counts and encoded message.
- To match the examples, be sure to use *nix-style line endings (LF not CRLF) for the input text files. The line endings can easily be changed in VS Code by clicking the CRLF/LF on the status bar on the bottom right of the editor.
- Two example input files are downloadable below under Submission Instructions.
Implementation Details
- Compute the frequency of each character in the file.
- Remember that each character is also an int with the int value equal to its ASCII value.
- Consider making an array to count each character's appearances:
- frequencies['a'] is the same as frequencies[97]
- If you want to know the frequency of a character, then frequencies[ch] could give you the frequency.
- Build a Huffman tree with each character that appears in the file as a leaf node in the tree.
- You will need to build a Min-Heap to implement the Huffman tree algorithm.
- An efficient array-based min-heap (O(lg n) insert and extract-min) is required.
- The min-heap should contain Huffman tree nodes with the frequency as the keys.
- The min-heap can be implemented as either a 0-based or 1-based array. The formulae are as follows:
- 0-based:
- parent = (child-1)/2
- leftChild = parent*2 + 1
- rightChild = parent*2 + 2
- 1-based (like the slides):
- parent = child/2
- leftChild = parent*2
- rightChild = parent*2 + 1
- 0-based:
- Construct a table containing the string binary encoding for each character.
- The encoding for each character is not guaranteed to be unique and will depend on the implementation; however, the compressed length of each file should be the same if the implementation is correct.
- Compute the binary encoding of the input file.
- Build all the data structures yourself, following the pseudocode from the slides (and potentially the ZyBook).
- Use good object-based organization, using classes in an appropriate way.
- You are encouraged to (and will likely) use recursion; however, be sure the recursion works and does not crash the program when run on large inputs from a stack overflow.
- Only include iostream, string, fstream, sstream, vector.
Output
Output the following information in the same format as the examples below:
- The input file's uncompressed length in bits (string_length * 8).
- The input file's compressed length in bits (string length of encoded string of 1s and 0s).
- The input file's alphabet, frequency, and code table in the following format: 'char'|frequency|binary_encoding_length
- The binary encoding length of a character is the encoding's number of bits. If the binary code is 0101, then the length is 4.
Step by Step Solution
There are 3 Steps involved in it
Get step-by-step solutions from verified subject matter experts
