Home -> Online Articles -> Data Compression Techniques and Formats

11-01-2007

Data Compression Techniques and Formats
Compression Theory Older Than the Computer Era

For as long as we have had computers data compression has been an important issue. Essentially there are two reasons for data compression: compressed data require less disk space and communication of compressed data is faster. Already in the early days of computers file size was an issue, which required an efficient use of limited disk space. In modern times disk space seems unlimited, but for many types of data the file size can be huge. Communicating and using these files requires efficient storage and thus data compression.

By Robin Wevers


Claude Shannon created a foundation for modern information theory. In 1948 Shannon published his ideas in ‘A Mathematical Theory of Communication’. Copyrights: Lucent Technologies.
History
In 1948 Claude E. Shannon (1916-2001) presented his paper ‘A Mathematical Theory of Communication’. In this paper he formulated his theory of data compression. Shannon established that there is a fundamental limit to lossless data compression, which is called the entropy rate. The value of the entropy rate depends on the nature of the information source. It is impossible to compress data with a compression rate that is higher than the entropy rate without losing information.

Shannon also developed the theory of lossy data compression. In lossy data compression, the decompressed data does not have to be exactly the same as the original data. The theory about data compression sets fundamental limits on the performance of all data compression algorithms.

Mathematics and Computing
It requires an efficient algorithm to achieve a high compression ratio. In mathematics and computing, an algorithm is a set of well-defined instructions for accomplishing some task which, given an initial state, will terminate in a defined end-state. The quality of the algorithm depends on the instructions. Algorithms that are not defined properly can lead to incorrect results. An obvious example of an algorithm is a computer program. A computer program is a set of instructions in a specific order, designed for a specific task. In literature a number of types of algorithms are mentioned. Most of these algorithms are intended for problem-solving: a greedy algorithm for example at each stage chooses what looks best at that moment without considering all parameters. It progresses in a fashion making one greedy choice after another iteratively reducing each given problem into a smaller one. An example would be when you are in the mountains and want to climb the highest mountain: from your position in the valley you look around for the highest mountain you can see and start climbing. Only when you are near the top you notice a higher mountain that was earlier hidden from your view.  A greedy algorithm not always finds the best solution. This is the main difference with a dynamic programming algorithm, which is exhaustive and guaranteed to find the optimum solution. Another well-known algorithm is the brute force algorithm: a problem-solving technique, which consists of systematically enumerating all possible candidates for the solution and checking whether each candidate satisfies the requirements. Brute-force search is simple to implement, and will always find a solution if it exists. Many other types are mentioned in literature. Examples can be found on http://en.wikipedia.org/wiki/Algorithm.

Algorithms for Data Compression
Data compression obviously requires an algorithm to compress data in such a way that the data can later be recovered by a decompression algorithm. This requires a different type of algorithm, for example the Huffman code, a lossless encoding scheme developed in 1952 by David Huffman for compressing text. The Huffman code uses variable length codes for encoding symbols (such as text characters). The variable-length code is based on the estimated probability of occurrence for each possible value of the source symbol. In this algorithm the code for ‘e’ will be short, whereas the code for ‘q’ will be long, thus ensuring a compact encoding of text. Run-length encoding (RLE) is another simple form of data compression in which identical sequences of data are stored as a single data value and count, rather than as the original data. For example a black and white image where all values are either 0 (black) or 1 (white) you may have a string that looks like “1111111111111100000000000000011111111111”. This can be stored more efficiently by: 14W(hite)-15B(lack)-11W(hite). This is most useful on data that contains many long sequences: for example simple graphic images such as icons and line drawings. An interesting mathematical approach to compression is presented by the wavelet theory. Wavelets are functions that satisfy certain mathematical rules and are used in representing data or other functions. In the early 1800's Joseph Fourier discovered that he could superpose sines and cosines to represent other functions, leading to the well-known Fourier transforms. Wavelet algorithms process data at different scales or resolutions. Wavelets are well-suited for approximating data with sharp discontinuities. Wavelet analysis allows you to choose the best wavelets adapted to your data and ‘sparsely’ represent the data. This ‘sparse coding’ makes wavelets an excellent tool in the field of data compression.

Lossless and Lossy
The most important characteristic of a data compression technique is whether the algorithm is lossless or lossy. Lossless compression is a compression technique that does not lose any data in the compression process. Lossless compression is convenient for transferring files across the Internet, as smaller files are transferred faster. Lossless compression is also useful for storing files because they take up less space. The advantage of lossless compression clearly is that the compressed file will decompress to an exact duplicate of the original file. The disadvantage is that the compression ratio is usually not very high and can theoretically even result in bigger files. To get a higher compression ratio requires a lossy compression. Because of the data loss, only certain applications are fit for lossy compression, like some graphics, audio and video. Lossy compression reduces the quality of the file, but depending on the application, the loss may be acceptable or even unnoticeable.

GeoTIFF
GeoTIFF represents an effort by over 160 different companies and organizations in remote sensing, GIS, cartographic, and surveying to establish a TIFF based interchange format for georeferenced raster imagery. TIFF is the most popular data format in the world today for the storage, transfer and display of raster images, such as clipart, logotypes and scanned documents. Today, TIFF is also being used for storage of map information. The TIFF imagery file format can be used to store and transfer digital satellite imagery, aerial photos, elevation models, scanned maps or the results of geographic analyses. Over the past several years many users of such images have requested geographic data suppliers to provide imagery in TIFF format. TIFF is a public domain format, which is capable of supporting compression, tiling, and can be extended to include geographic metadata. GeoTIFF is the implementation of the geographic metadata. GeoTIFF concerns TIFF files that have geographic information embedded in the TIFF-file. Using this geographic information the image can be positioned in the correct location. GeoTIFF is a metadata format, which provides both the image data and the associated geographic information. The TIFF file structure allows both the metadata and the image data to be encoded into the same file.

The GeoTIFF is an open, public domain format. It is freely available to software companies; product managers can for each product decide to include GeoTIFF capabilities. As a result many systems today can read GeoTIFF files into the correct geographic position: many Geographic Information Systems, Image Processors and Computer Aided Design applications.

GeoPDF
GeoPDF is a proprietary extension to the Adobe PDF file format, from TerraGo Technologies. GeoPDF is a mapping format that uses the Adobe document standard to let you share georegistered maps. The geo-extension adds a coordinate transformation matrix and other metadata to allow transformation of PDF coordinates to a projected Cartesian coordinate system. GeoPDFs often include other advanced PDF features such as layers and object data which can add significant GIS functionality to the file, particularly when used with the TerraGo Technologies plugin to Adobe Reader. The TerraGo Technologies GeoPDF utilizes a method for embedding cartographic data within a PDF. This is based on the ability to transform coordinates in PDF space to coordinates in the Cartesian coordinate system which represents a map projection, and from those coordinates to the latitude and longitude of the spheroid described by a named datum and projection. The reverse transformations are also possible. The method for accomplishing these transformations is generalized such that a single PDF may support multiple map frames within a single PDF. The identification of map frames and the actual transformations are accomplished by a plugin, which reads the data and performs the required mathematics.


According to the theory about data compression there are fundamental limits on the maximum compression rate that can be achieved without loss of information.
MrSID
MrSID (pronounced Mister Sid) is an acronym that stands for Multiresolution Seamless Image Database. MrSID is an image compressor, viewer, and file format for extremely large raster graphics images. MrSID, created by LizardTech, works by putting together hundreds of small image tiles into one large seamless image that can be compressed and decompressed with little or no degradation. A problem that was common with large raster images was the long amount of time it took to open these files. The creators of MrSID overcame that problem by giving the user the ability to decompress just the portion of the image they wanted to view. The user can move from one part of the image to another quickly without having to wait for the entire image to decompress. Another problem that is associated with large raster images was the amount of storage space they require. MrSID uses a patented ‘wavelet’ technology first developed at the National Research Laboratories at Los Alamos to achieve a compression ratio of 20:1 for greyscale images and 50:1 for full colour images. A satellite photo from space that once required the storage space of forty CD-ROMs, can now be compressed and stored on one CD-ROM. The wavelet technology used by MrSID relies on advanced mathematical algorithms both to compress an image and build the viewer. MrSID technology uses lossless wavelet compression to create an initial image. Then the encoder divides the image into zoom levels, subbands and bitplanes. After the initial encoding, the image creator can apply zero or more optimizations. While 2:1 compression ratios may be achieved losslessly, higher compression rates are lossy much like JPEG-compressed data. Since the image remains geometrically accurate after compression, it can be geo-referenced before compression or overlaid with other referencing data. MrSID files are a binary MIME type, that is well suited to be sent over the Internet. Only the requested zoom and area of the image is sent to the browser. MrSID was developed primarily for Geographic Information Systems, but is also used in photography, mapping, document management, medical imaging, and games. LizardTech offers a software package called GeoExpress to read and write MrSID files. They also provide a free web browser plug-in. Most major GIS software packages can read MrSID files including those from ESRI, Bentley Systems, MapInfo and Autodesk.


An example of a wavelet.
Winzip, JPEG, MP3
A well-known compression methodis Winzip. which is widely used for compressing data files or program files. Clearly Winzip is a lossless algorithm; when these files are decompressed, all bytes must be present to ensure their integrity. If bytes are missing from a program, it won't run. If bytes are missing from a data file, it will be incomplete and unreliable. Joint Photographic Experts Group (JPEG) was designed for compressing either full-color or gray-scale images. It works well on photographs and similar material, but not so well on lettering or line drawings. JPEG is a lossy compression. Converting a GIF-file to JPEG will usually reduce its size, but it will also reduce the quality to some extent. JPEG is designed to exploit known limitations of the human eye, notably the fact that small colour changes are perceived less accurately than small changes in brightness. Thus, JPEG is suitable for compressing images that will be looked at by humans. JPEG is not suitable for images that will be analysed automatically: the small errors introduced by JPEG may cause problems, even if they are invisible to the eye. A difference between JPEG and GIF is that JPEG stores 24 bits per pixel (16 million colours): GIF, the other widely used image format on the net, can only store 8 bits per pixel (256 or fewer colours). Full-colour monitors are getting cheaper all the time and JPEG images look better than GIFs on such hardware. It seems likely that GIF will become obsolete over the next couple of years. JPEG handles only still images, but there is a related standard called Motion Pictures Experts Group (MPEG) for motion pictures. MPEG was developed for dealing with digital video; since most video also contains audio, MP3 was developed as an audio extension of that work. A standard WAV-file that is converted to a MP3 file will usually be much smaller, since MP3 employs a lossy, high-compression algorithm that eliminates much of the data. MP3-compression uses the fact that the human ear cannot hear soft sounds in the presence of loud sounds having a similar frequency. The compression makes the resulting file much smaller so that several dozen MP3 files can fit, for example, on a single compact disk. The sound quality of the MP3 file will be slightly lower than the original WAV-file, but only an experienced ear will notice the difference.

Conclusion
The question now is of course which compression technique and format to use for geographic information. Vital for answering this question is whether or not a loss of information is acceptable and the importance of achieving a high compression rate. For example program files and data that will be used for detailed analyses allow no loss of information, whereas files that will be used solely for publications or presentations this loss is not important. When a loss of information is acceptable JPEG can achieve a relatively high rate for map information and can still be adequate for publications. For many other applications where information loss cannot be tolerated GeoTIFF and MrSID are good options, where MrSID will usually give the best compression ratio.

Robin Wevers (r.r.wevers@freeler.nl) is a freelance writer of geo-ict-articles. More information can be found at www.remotesensing.org/geotiff/geotiff.htmland http://en.wikipedia.org/wiki/Compressionand www.faqs.org/faqs/compression-faq/part2/index.html The paper by Claude Shannon can be found at:
http://cm.bell-labs.com/cm/ms/what/shannonday/shannon1948.pdf.

Back to Homepage

Back to Online Articles