Run-length codes
Run length encoding stands out from other methods of compression. It does not try to reduce the average symbol size like Huffman coding or arithmetic coding, and it doesn't replace strings with dictionary references like Lemple-Ziv style coding. RLE replaces a string of repeated symbols with a single symbol and a count (run length) indicating the number of times the symbol is repeated.
Example:
The string: "aaaabbcdeeeeefghhhij" may be replaced with "a4b2c1d1e5f1g1h3i1j1".
The numbers are in bold to indicate that they are values, not symbols.
If you count the before and after string size, you'll notice that RLE didn't make this string any smaller. Part of the reason is because it now takes a symbol and a value to represent what used to take just single a single symbol.
One of the ways of preventing single symbols from expanding is to only apply coding to runs of more than one and use the run length to indicate the number of additional copies of the encoded symbol are required. Ideally that would give us a string like this: "a3b1cde4fgh2ij".
Now the string is shorter, but we have another problem. How do we know whether a symbol is being followed by another symbol or a run length? We could borrow a page out of LZSS and include an encoded/not encoded flag with each symbol. An other option is to follow each encoded symbol with a designated escape code that will indicate a run length follows.
The escape code solution is pretty easy to handle, except that you need to be able to make sure that the escape code doesn't occur in your uncoded input, or you need to be able to indicate if you're just handling a symbol that is the same as an escape code.
The escape code does not need to be a fixed value, instead an escape code will be inferred whenever a symbol repeats itself. The value that follows the repeated symbol will indicate the number of additional symbols that follow. Using this scheme, the example string is encoded like this: "aa2bb0cdee3fghh1ij".
The downsides to this method of encoding are that one more symbol is required for each run, and runs of length 2 take more space encoded than not encoded. However, based on a few brief tests on files containing C source, the double symbol escape sequence produced better results than no escape sequence at all.
Encoding Strings
Encoding using RLE is fairly simple:
Step 1. Set the previous symbol equal to an unmatchable value.
Step 2. Read the next symbol from the input stream.
Step 3. If the symbol is an EOF exit.
Step 4. Write out the current symbol.
Step 5. If the symbol is an does not match the previous symbol, set the previous symbol to the current symbol, and go to step 2.
Step 6. Read and count additional symbols until a non-matching symbol is found. This is the run length.
Step 7. Write out the run length.
Step 8. Write out the non-matching symbol.
Step 9. Set the previous symbol to the non-matching symbol, and go to step 2.
When actually implementing RLE, a little attention to detail is required in Step 6. The run length is stored in a finite number of bits (I used an unsigned char), so runs longer than the amount that can be counted need to be broken up into to smaller runs. When the maximum count is reached, just store the count value and start the process of looking for a run all over again. You also need to handle the case where a run is ended by an EOF. When a run is ended by an EOF, write out the run length and exit.
That's all there is to it.
Decoding Strings
Decoding is even easier than encoding. Not only are there less steps, but there are no caveats. To decode an encoded stream:
Step 1. Set the previous symbol equal to an unmatchable value.
Step 2. Read the next symbol from the input stream.
Step 3. If the symbol is an EOF exit.
Step 4. Write out the current symbol.
Step 5. If the symbol is an does not match the previous symbol, set the previous symbol to the current symbol, and go to step 2.
Step 6. Read the run length.
Step 7. Write out the current symbol for the amount indicated by the run length.
Step 8. Go to step 1.
|