COSC 2010 / 2100 Data Structures

Fall 2014

Homework Assignment #7 : Huffman Compression

"If 'compression is the first grace of style,' you have it."
-- Marianne Moore
Due: Tuesday, November 25th, Noon CST
Submit: Turn in your HuffmanRunner.java source file (and any other helper Java source files you devise) using the turnin command on morbius.mscs.mu.edu.
Work may be completed in teams of two.
Be certain to include both partner's names in your file. Try to avoid both partners submitting code -- we have to grade that twice, and only the lower grade will be kept.
You may submit multiple times, but only the last turnin will be kept. The automatic submission system will not accept work after the deadline.

Huffman Coding

Huffman Coding is a lossless encoding scheme that works well for compressing files where a small subset of the symbols occur more frequently.

For this assignment, we will build a Frequency Table that counts the occurrences of each letter of the alphabet in an input file. All letters will be treated as upper case. Any characters that are not an alphabet letter will be ignored.

The Frequency Table can be based upon the sorted array list implementation from previous chapters, combined with simplifications of the Word Frequency case study in Chapter 9 of the text.

Once the Frequency Table is complete, build a Huffman Tree using the frequencies for each letter. The algorithm for building a Huffman Tree can be expressed in this way:

while (there is more than one entry in the work queue)
    Remove the two lowest frequency items from the queue,
    Build a new tree node with the two items as children,
     and their combined frequency as the total,
    Sort the new tree node into the work queue.

The priority queue data structure used in this phase can most easily be constructed as an extension of one of the sorted list classes in the previous chapters, with a new method that allows the lowest item to be dequeued. Characters with the same frequency are initially sorted in alphabetical order.

Once the Huffman Tree is constructed, each individual letter's compression code can be looked up by representing the path from the root of the tree to that letter's leaf node, using '0' bits for left children, and '1' bits for right children.

Input

Input to the Huffman Compressor will consist of a filename given on the command line. Your main program should open this file for reading, calculate the Huffman Code Table, and print compressed file contents.

Example Run #1

Run with a simple input file.
Input File:
AABABC

Output:
Huffman Code Table:
00003 A: 1
00002 B: 00
00001 C: 01

Compressed File Contents:
A: 1
A: 1
B: 00
A: 1
B: 00
C: 01

Compressed 48 bits into 9 bits.
Compression rate 81%

In this example, the initial Frequency Table will be: [A:3, B:2, C:1], and those frequencies are noted in the leftmost column of the Huffman Code Table above.

When sorted into a priority queue by lowest frequency, the queue will thus be: [C:1, B:2, A:3].

The first step of building the Huffman Tree will dequeue C and B, combining them into a new tree node with total frequency 3. Our reference implementation places the first item dequeued into the right child of the new tree node. When sorted back into the priority queue, the new combined node with frequency 3 will come after the node for A with frequency 3 -- an item inserted with equal frequency will go to the back of the group of equal items: [A:3, Node(B,C):3].

The second step of building the Huffman Tree will dequeue A and the BC node, combining them into a new tree node with total frequency 6. When inserted back into the priority queue, this will become the sole item remaining, and construction of the Huffman Tree is complete.

Huffman Tree from Example Run #1

Taking the codes from the tree by reading left branches as '0' and right branches as '1', the final Huffman Code Table has A = 1, B = 00 and C = 01.

The compression figures are calculated by totaling the code length for each letter, and assuming 8 bits per letter in the original encoding.

Example Run #2

Run with a Hello World input file.
Input File:
Hello world!

Output:
Huffman Code Table:
00003 L: 01
00002 O: 000
00001 D: 111
00001 E: 110
00001 H: 101
00001 R: 100
00001 W: 001

Compressed File Contents:
H: 101
E: 110
L: 01
L: 01
O: 000
W: 001
O: 000
R: 100
L: 01
D: 111

Compressed 80 bits into 27 bits.
Compression rate 66%

The priority queue during Huffman Tree construction looks like this:

Step 1. [D:1, E:1, H:1, R:1, W:1, O:2, L:3], dequeue D:1 and E:1

Step 2. [H:1, R:1, W:1, O:2, ED:2, L:3], dequeue H:1 and R:1

Step 3. [W:1, O:2, ED:2, RH:2, L:3] , dequeue W:1 and O:2

Step 4. [ED:2, RH:2, L:3, OW:3], dequeue Node(E,D):2 and Node(R,H):2

Step 5. [L:3, OW:3, RHED:4], dequeue L:3 and Node(O,W):3

Step 6. [RHED:4, OWL:6], dequeue Node(Node(R,H):2, Node(E,D):2):4 and Node(Node(O,W):3, L):6

Step 7. [OWLRHED:10]

Huffman Tree from Example Run #3

Reference Implementation

For your convenience, a reference implementation of the Huffman Compressor has been provided on Morbius in the directory ~brylow/cosc2100/Projects/Huffman/. Several sample input text files, and a template HuffmanRunner.java file has also been provided in that directory.

To execute the reference implementation on a sample input file, login to Morbius and execute the following:

cd ~brylow/cosc2100/Projects/Huffman/
./huffman examplerun1.txt

[Revised 2014 Nov 19 10:50 DWB]