Optimal Parsing for dictionary text compression

Abstract : Dictionary-based compression algorithms include a parsing strategy to transform the input text into a sequence of dictionary phrases. Given a text, such process usually is not unique and, for compression purpose, it makes sense to find one of the possible parsing that minimizes the final compression ratio. This is the parsing problem. An optimal parsing is a parsing strategy or a parsing algorithm that solve the parsing problem taking account of all the constraints of a compression algorithm or of a class of homogeneous compression algorithms. Compression algorithm constrains are, for instance, the dictionary itself, i.e. the dynamic set of available phrases, and how much a phrase weight on the compressed text, i.e. the length of the codeword that represent such phrase also denoted as the cost of a dictionary pointer encoding. In more than 30th years of history of dictionary based text compression, while plenty of algorithms, variants and extensions appeared and while such approach to text compression become one of the most appreciated and utilized in almost all the storage and communication process, only few optimal parsing algorithms was presented. Many compression algorithms still leaks optimality of their parsing or, at least, proof of optimality. This happens because there is not a general model of the parsing problem that includes all the dictionary based algorithms and because the existing optimal parsings work under too restrictive hypothesis. This work focus on the parsing problem and presents both a general model for dictionary based text compression called Dictionary-Symbolwise theory and a general parsing algorithm that is proved to be optimal under some realistic hypothesis. This algorithm is called Dictionary-Symbolwise Flexible Parsing and it covers almost all the cases of dictionary based text compression algorithms together with the large class of their variants where the text is decomposed in a sequence of symbols and dictionary phrases.In this work we further consider the case of a free mixture of a dictionary compressor and a symbolwise compressor. Our Dictionary-Symbolwise Flexible Parsing covers also this case. We have indeed an optimal parsing algorithm in the case of dictionary-symbolwise compression where the dictionary is prefix closed and the cost of encoding dictionary pointer is variable. The symbolwise compressor is any classical one that works in linear time, as many common variable-length encoders do. Our algorithm works under the assumption that a special graph that will be described in the following, is well defined. Even if this condition is not satisfied it is possible to use the same method to obtain almost optimal parses. In detail, when the dictionary is LZ78-like, we show how to implement our algorithm in linear time. When the dictionary is LZ77-like our algorithm can be implemented in time O(n log n). Both have O(n) space complexity. Even if the main aim of this work is of theoretical nature, some experimental results will be introduced to underline some practical effects of the parsing optimality in compression performance and some more detailed experiments are hosted in a devoted appendix
Document type :
Theses
Complete list of metadatas

Cited literature [37 references]  Display  Hide  Download

https://tel.archives-ouvertes.fr/tel-00804215
Contributor : Abes Star <>
Submitted on : Monday, March 25, 2013 - 11:02:10 AM
Last modification on : Wednesday, April 11, 2018 - 12:12:02 PM
Long-term archiving on : Wednesday, June 26, 2013 - 4:01:26 AM

File

TH2012PEST1091_complete.pdf
Version validated by the jury (STAR)

Identifiers

  • HAL Id : tel-00804215, version 1

Citation

Alessio Langiu. Optimal Parsing for dictionary text compression. Other [cs.OH]. Université Paris-Est, 2012. English. ⟨NNT : 2012PEST1091⟩. ⟨tel-00804215⟩

Share

Metrics

Record views

703

Files downloads

1249