Question: Lab #6 HUFFMAN CODING Huffman coding is an algorithm devised by David A. Huffman of MIT in 1952 for compressing text data to make a

Lab #6

HUFFMAN CODING Huffman coding is an algorithm devised by David A. Huffman of MIT in 1952 for compressing text data to make a file smaller (fewer bytes). This relatively simple algorithm is powerful enough that variations of it are still used today in computer networks, fax machines, modems, HDTV, and other areas. Normally text data is stored in a standard format of 8 bits per character using an encoding called ASCII that maps every character to a binary integer value from 0-255. The idea of Huffman coding is to abandon the rigid 8-bits-per-character requirement and use variable-length binary encodings for different characters. The advantage of doing this is that very frequently occurring 2 characters (say e in English text) could be given very short encodings (fewer bits). Although this means some characters may need more than 8 bits, the tradeoff is often worth it because those characters occur very infrequently. Refer to the Lecture 11/8/23 for an introduction to Huffman Coding.

The steps involved in Huffman encoding are: 1. Count character frequencies: Make a quick pass through the file to be encoded, counting how many of each distinct character you find in the text. 2. Build encoding tree: Build a binary tree using a specific queue-based algorithm. 3. Build encoding map: Traverse the binary tree to record the binary encodings of each character in a map for future quick-access use. 4. Encode the file: Do a second pass through the file, looking up each byte (each character) in the map from Step 3, to get the encoding for each byte.

STEP 1: COUNT CHARACTER FREQUENCIES Suppose we have a file named example.txt, whose contents are: "ab ab cab": In the original file, this text occupies 10 bytes (80 bits) of data. The 10th is a special "end-of-file" (EOF) byte.

Byte 1 2 3 4 5 6 7 8 9 10
Char 'a' 'b' ' ' 'a' 'b' ' ' 'c' 'a' 'b' EOF
ASCII 97 98 32 97 98 32 99 97 98 256
Binary 01100001 01100010 00100000 01100001 01100010 00100000 01100011 01100001 01100010 N/A

In Step 1 of Huffman's algorithm, a count of each character is computed. The counts are represented as a map: {' ':2, 'a':3, 'b':3, 'c':1, EOF:1}

STEP 2: BUILD ENCODING TREE Step 2 of Huffman's algorithm places our counts into binary tree nodes, with each node storing a character and a count of its occurrences. The nodes are then put into a sorted list, which keeps them in prioritized order with smaller counts having higher priority, so that characters with lower counts will come out of the list sooner. (The priority list is somewhat arbitrary in how it breaks ties, such as 'c' being before EOF and 'a' being before 'b'). head +-----+ +-----+ +-----+ +-----+ +-----+ | 'c' | | EOF | | ' ' | | 'a' | | 'b' | | 1 | | 1 | | 2 | | 3 | | 3 | +-----+ +-----+ +-----+ +-----+ +-----+ Now the algorithm repeatedly removes the two nodes from the front of the queue (the two with the smallest frequencies) and joins them into a new node whose frequency is their sum. The two 3 nodes are placed as children of the new node; the first removed becomes the left child, and the second the right. The new node is re-inserted into the list in sorted order. This process is repeated until the list contains only one binary tree node with all the others as its children. This will be the root of our finished Huffman tree. See the picture below shows the final resulting tree. Notice that the nodes with low frequencies end up far down in the tree, and nodes with high frequencies end up near the root of the tree. This structure can be used to create an efficient encoding in the next step.

STEP 3: BUILD ENCODING MAP Step 3 of Huffman's algorithm takes the tree from Step 2 (see diagram above) and performs a traversal of it to harvest all the binary sequences that will be the encodings for each character. The binary sequence for each character is derived from your binary tree by thinking of each left branch as a bit value of 0 and each right branch as a bit value of 1, as shown in the diagram at above. The code for each character can be determined by traversing the tree. To reach ' ' [space], we go left twice from the root, so the code for ' ' is 00. The code for 'c' is 010, the code for EOF is 011, the code for 'a' is 10 and the code for 'b' is 11. By traversing the tree, we can produce a map from characters to their binary representations. Though the binary representations are integers, since they consist of binary digits and can be arbitrary length, we will store them as strings. For this tree, it would be: {' ':"00", 'a':"10", 'b':"11", 'c':"010", EOF:"011"}

STEP 4: ENCODING THE TEXT DATA Using the encoding map from Step 3, we can encode the files contents into a shorter binary representation. Using the preceding encoding map, the text "ab ab cab" would be encoded as 1011001011000101011011. The following table gives the char-to-binary mapping in more detail. The overall encoded contents of the file require 22 bits, or almost 3 bytes, compared to the original file of 10 bytes.

Char 'a' 'b' ' ' 'a' 'b' ' ' 'c' 'a' 'b' EOF
Binary 10 11 00 10 11 00 010 10 11 011

Since the character encodings have different lengths, often the length of a Huffman-encoded file does not come out to an exact multiple of 8 bits. Files are stored as sequences of whole bytes, so 4 in cases like this the remaining digits of the last bit are filled with 0s.

It might worry you that the characters are stored without any delimiters between them, since their encodings can be different lengths and characters can cross byte boundaries, as with 'a' at the end of the second byte. But this will not cause problems in decoding the file, because Huffman encodings by definition have a useful prefix property where no character's encoding can ever occur as the start of another's encoding.

DECODING THE FILE A compressed file would not be very useful unless we have the ability to decompress it again! You can use a Huffman tree to decode text that was previously encoded with its binary patterns. The decoding algorithm is to read each bit from the file, one at a time, and use this bit to traverse the Huffman tree. If the bit is a 0, you move left in the tree. If the bit is 1, you move right. You do this until you hit a leaf node. Leaf nodes represent characters, so once you reach a leaf, you output that character. For example, suppose we are given the same encoding tree above, and we are asked to decode a file containing the following bits: 1110010001001010011.

Using the Huffman tree, we walk from the root until we find characters, then output them and go back to the root.

- We read a 1 (right), then a 1 (right). We reach 'b' and output b. Back to the root. 1110010001001010011 - We read a 1 (right), then a 0 (left). We reach 'a' and output a. Back to root. 1110010001001010011 - We read a 0 (left), then a 1 (right), then a 0 (left). We reach 'c' and output c. 1110010001001010011 - We read a 0 (left), then a 0 (left). We reach ' ' and output a space. 1110010001001010011 - We read a 1 (right), then a 0 (left). We reach 'a' and output a. 1110010001001010011 - We read a 0 (left), then a 1 (right), then a 0 (left). We reach 'c' and output c. 1110010001001010011 - We read a 1 (right), then a 0 (left). We reach 'a' and output a. 1110010001001010011 - We read a 0, 1, 1. This is our EOF encoding pattern, so we stop.

The overall decoded text is "bac aca".

YOUR ASSIGNMENT

Your assignment is to write an ARM assembler code to:

1. Read in a given Text file containing only letters, periods, and spaces. Perform a histogram analysis of the characters to determine the frequency of occurrences and generate the Huffman code as described. When

2. Given an encoded transmitted message, in which the bits are compacted into several 4-byte unsigned integers, decode and print out the original text message.

3. Please provide a Makefile as part of your build.

Example:

For the example above, the secret message would fit into a single 4-byte memory location as:

11100100010010100110000000000000

Note the 13 trailing 0s at the end of the message. You can encode the end of each 4-byte memory location with a bunch of spaces if you want, or you can define an extra character in addition to the letters, period, and spaces to encode an end-of-memory character to mark the end of a 4-byte memory block. The design is up to you. There's no problem with printing out a few spaces between letters.

Your decode.s function should take in as input a linked list of 4-byte unsigned integers containing the secret message:

- decode(r0) - where r0 contains the address of the first node of the linked list

where

r0 -->node1-->node2-->node3-->NULL

node1: 10101101010101010101010001110010 --> 10101101010101010101010001110010 --> 10101101010101010101010001110010 --> NULL

so if a binary code does not fit in at the end of a node, it will continue on into the next node so that there is no wasted bit.

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!