Question: Overview Reads a plain text file Builds a Huffman tree from the character frequencies in the file Outputs the original length in bits, Huffman compression

Overview

  1. Reads a plain text file
  2. Builds a Huffman tree from the character frequencies in the file
  3. Outputs the original length in bits, Huffman compression length in bits, and the alphabet|frequency|code_length table.

Input

  • The program accepts the input filename as a command-line argument: ./a.out input_filename.txt
  • The input filename will contain less than 100,000 characters with ASCII values in the range [0,127] (ASCII Table Reference Link).
  • Every single character in the input file will be included in the frequency counts and encoded message.
  • To match the examples, be sure to use *nix-style line endings (LF not CRLF) for the input text files. The line endings can easily be changed in VS Code by clicking the CRLF/LF on the status bar on the bottom right of the editor.
  • Two example input files are downloadable below under Submission Instructions.

Implementation Details

  • Compute the frequency of each character in the file.
    • Remember that each character is also an int with the int value equal to its ASCII value.
    • Consider making an array to count each character's appearances:
      • frequencies['a'] is the same as frequencies[97]
      • If you want to know the frequency of a character, then frequencies[ch] could give you the frequency.
  • Build a Huffman tree with each character that appears in the file as a leaf node in the tree.
  • You will need to build a Min-Heap to implement the Huffman tree algorithm.
    • An efficient array-based min-heap (O(lg n) insert and extract-min) is required.
    • The min-heap should contain Huffman tree nodes with the frequency as the keys.
    • The min-heap can be implemented as either a 0-based or 1-based array. The formulae are as follows:
      • 0-based:
        • parent = (child-1)/2
        • leftChild = parent*2 + 1
        • rightChild = parent*2 + 2
      • 1-based (like the slides):
        • parent = child/2
        • leftChild = parent*2
        • rightChild = parent*2 + 1
  • Construct a table containing the string binary encoding for each character.
    • The encoding for each character is not guaranteed to be unique and will depend on the implementation; however, the compressed length of each file should be the same if the implementation is correct.
  • Compute the binary encoding of the input file.
  • Build all the data structures yourself, following the pseudocode from the slides (and potentially the ZyBook).
    • Use good object-based organization, using classes in an appropriate way.
    • You are encouraged to (and will likely) use recursion; however, be sure the recursion works and does not crash the program when run on large inputs from a stack overflow.
  • Only include iostream, string, fstream, sstream, vector.

Output

Output the following information in the same format as the examples below:

  1. The input file's uncompressed length in bits (string_length * 8).
  2. The input file's compressed length in bits (string length of encoded string of 1s and 0s).
  3. The input file's alphabet, frequency, and code table in the following format: 'char'|frequency|binary_encoding_length
    • The binary encoding length of a character is the encoding's number of bits. If the binary code is 0101, then the length is 4.

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Mathematics Questions!