Lempel-Ziv Algorithms. LZ77 (Sliding Window). Variants: LZSS (Lempel-Ziv- Storer-Szymanski); Applications: gzip, Squeeze, LHA, PKZIP, ZOO. LZ78 ( Dictionary. version of LZ77, called LZSS, and one improved version of LZ78, called LZW. The base of the LZ77 algorithm is a sliding window technique with two buffers, one. CULZSS algorithm proposed in  parallelizes the LZSS algorithm at two levels. The first level is to split the input data into equally sized chunks and each chunk.
|Published (Last):||20 June 2008|
|PDF File Size:||16.80 Mb|
|ePub File Size:||3.72 Mb|
|Price:||Free* [*Free Regsitration Required]|
As with everything else that I’ve written, my LZSS implementation left and still leaves room for improvement. This web page attempts to discuss my implementation and the updates that have sporadically taken place since my original LZSS release. I also toyed with an in-memory implementation of the LZSS algorithm that preforms all encode and decode operations on arrays rather than files.
This implementation might be useful to those developing on systems that do not include a file system. Information on downloading the source code for all of my LZSS implementations may be found here.
LZSS is a dictionary encoding technique. Unlike Huffman coding, which attempts to reduce the average amount of bits required to represent a symbol, LZSS attempts to replace a string of symbols with a reference to a dictionary location for the same string. It is intended that the dictionary reference should be shorter than the string it replaces.
The LZSS algorithm and it’s predecessor LZ77 attempt to compress series of strings by converting the strings into a dictionary offset and string length.
The larger Nthe longer it takes to search the whole dictionary for a match and the more bits will be required to store the offset into the dictionary. Typically dictionaries contain an amount of symbols that can be represented by a whole power of 2. A symbol dictionary would require 9 bits to represent all possible offsets. If you need to use 9 bits, you might as well have a symbol dictionary so that you can have more entries.
Additional new symbols cause an equal number of the oldest symbols to slide out. As stated above, encoded strings are represented as an offset and a length.
Since the dictionary size is fixed at N the largest offset may be N – 1and the longest string matching a series of characters in the dictionary may be N characters. For large Nit’s highly unlikely that you’ll ever read an N symbol string that matches the contents of dictionary, so the maximum allowed match length is often limited. In the examples I have seen, N is typically or and the maximum length allowed for matching strings is typically between 10 and 20 characters.
In their original LZ77 algorithm, Lempel and Ziv proposed that all strings be encoded as a length and offset, even strings with no match. Storer and Szymanski observed that individual unmatched symbols or matched strings of one or two symbols take up more space to encode than they do to leave uncoded. For example, encoding a string from a dictionary of symbols, and allowing for matches of up to 15 symbols, will require 16 bits to store an offset and a length.
A single character is only 8 bits. Using this method, encoding a single character doubles the required storage. Using the above example it now takes 9 bits for each uncoded symbol and 17 bits for each encoded string. Storer and Szymanski also observed that if you’re not encoding strings of length 0 to Mthen M may be used as an offset to the length of the match. Based on the discussion above, encoding input requires the following steps: Initialize the dictionary to a known value.
Read an uncoded string that is the length of the maximum allowable match. Search for the longest matching string in the dictionary.
If a match is found greater than or equal to the minimum allowable match length:. Shift a copy of the symbols written to the encoded output from the unencoded string to the dictionary. Read a number of symbols from the uncoded input equal to the number of symbols written in Step 4. Repeat from Step 3, until all the entire input has been encoded. The encoding process requires that a the dictionary is searched for matches to the string to be encoding.
Decoding an offset and length combination only requires going to a dictionary offset and copying the specified number of symbols. No searching is required. Decoding input requires the following steps: If the flag indicates an encoded string:.
Shift a copy of the symbols written to the decoded output into the dictionary. Repeat from Step 2, until all the entire input has been decoded. Further discussion of LZSS with links to other documentation and libraries may be found at http: All algoritgm my versions use codes that are integral bytes, and each of my newer versions are derived from my earlier versions so there is a lot of common design.
I chose to implement my dictionary as a character cyclic buffer sliding window. I chose characters, because others lzss achieved zlss results with dictionaries of this size, and I can encode offsets of 0 to in 12 bits.
If I encode the offset and length of a string in a 16 bit word, that leaves me with 4 bits for the length. Keeping the algoritym of a 16 bit encoded string in ,zss, and the above requirement for a 12 bit offset, I’m left with 4 bits to encode the string length. With 4 bits, I can encode lengths of 0 through However, as Storer and Szymanski observed, it only takes 1 byte to write out characters that match dictionary strings 0 or 1 character long, and 2 bytes to write out characters that match dictionary strings 2 characters long.
Savings of one or more bytes per string doesn’t occur unless the string is 3 characters or longer.
By not encoding lss that offer less than one byte of savings, the minimum encoded length is three characters. Since the minimum encoded alvorithm is three, I am able to add three to the 4 bit value stored in the length field, resulting in encoded string lengths of 3 to So 18 is the maximum string length encoded by this implementation. I’ve been calling my implementation a modified Aogorithm implementation.
It’s modified because of the way my version 0. Similarly writing an encoded string would require 17 alvorithm, 1 bit for the flag and 16 bits for the code. While I was algorighm the algorithm, I came across some implementations that stored encoded flags in groups of eight followed by algoriyhm characters or encoded strings that they represent.
I don’t know who to credit with the discovery of this technique, but it allowed my version 0. This technique also makes the EOF clear. By processing only bytes, there are no spare bits, os the EOF of the encoded dats is actually the EOF of the encoded file. Don’t worry about it if I lost you on the EOF discussion, just relax, because there’s no need to handle it specially. After writing my version 0. As stated above, one of my goals was to provide an LZSS implementation that is easy to read and easy to debug.
So, to keep things simple, my first implementation used a brute force sequential search to match the string to be encoded to strings in the dictionary. Through experimentation and reading, I’ve discovered that the methods used for string matching significantly impact encoding time. Other’s have successfully improved the time required to find matching strings using the following techniques:. I have already experimented with some of these techniques and plan to experiment with others as time allows.
c – understanding this LZSS based decompression algorithm – Stack Overflow
The rest of this section documents some of what I have tried so far. The first search I implemented was a sequential search. I simply start at the beginning of the sliding window dictionary and do a character by character comparisons with the string being encoded. If the first characters match, I check the characters that follow. If I don’t have a match, I move to the next character in the sliding window. The source code implementing a sequential search is contained in the version 0.
The logic is pretty simple, but may require a number of comparisons.
The addition of code implementing the KMP algorithm is a relatively new one version 0. By the time I got around to including it Wikipedia had a reasonable description as well as pseudocode that I could reference.
LZSS (LZ77) Discussion and Implementation
However KMP attempts to use some information about the string we’re attempting to find a match for and the comparisons already made in order skip some comparisons that must fail. The initial pass, will start by comparing string to dictionary and fail when it compares string to dictionary. The brute force sequential search will resume by comparing string to dictionary but we know this will fail because string matches dictionary and string and string are not the same.
KMP is smart enough to skip ahead and resume by comparing string to dictionarywhich agorithm to be a match in this example. The KMP algorithm requires that the string being searched aogorithm be preprocessed. The preprocessing generates a look-up table used to determine how far back from the failed comparison, the algorithm must go to resume comparisons.
The lzsss table is know as a “partial match table” or a “failure function”.
LZSS Compression Functions
The partial match table for our example string is depicted below:. When the example above failed to match string to dictionaryposition 5 of the partial match table was used to determine that search should fallback 2 to dictionary.
The source code implementing the KMP algorithm is contained in the file kmp. The majority of the code follows the outline of the pseudocode provided by Wikipedia.
The sequential search algorithm moves through the dictionary one character at a time, checking for matches to the first character in the string to be encoded. Assuming equal distribution of characters, there’s a 1 in alphabet size chance of characters matching. One approach to ensuring that the first character of the string being encoded matches the dictionary entry it is being compared to is to keep a list of all dictionary entries that start with a character for each character in the alphabet.
For example, there would be one list of all the entries that start with ‘A’, and another of all the entries that start with ‘B’. In the case of a character alphabet, lists must be maintained. Since the dictionary is a sliding window of the last characters encoded by the algorithm, the lists of strings starting with a given character must be updated as old characters are removed from the dictionary and new characters are added to the dictionary. Linked lists are a natural choice for implementing such dynamic lists.
The additional memory overhead required to implement a collection of linked lists is one pointer to the head of each of the lists, plus one pointer to the next character in the list for each character in the dictionary. The source code implementing a linked list search is contained in the version 0.
In my implementation, all pointers list head and next are int indices into the sliding window dictionary. After implementing string matches with linked listsit seemed like a wasn’t much effort to try matching using hash tables. In actuality, the linked lists approach, lzsa a solution involving hash tables where the hash key is the first character of the string and collisions are stored in a linked list.
If that’s all I wanted, I’d be done.