File Compression and Archival

09 Jul 2023 by Felix Kinaro About 2 min reading time

Overview

File compression is the process of reducing the size of a file or a group of files for faster transmission across a network, efficient space utilization on a server, or long-term storage. Additionally, web servers use compression to send smaller payloads to clients than they would if they sent uncompressed files. This requires editing server configuration files to enable.

Types of File Compression

i. Lossless file compression: The archived file is an exact replica of the source data when extracted. This is mostly used in shell utilities like Gzip, tar, and zip.
ii. Lossy file compression: Some data is sacrificed to achieve a high compression ratio. This is mostly true in application-specific file formats like pictures saved in JPEG format. Some data is stripped away and the resulting file is not an exact match of the original file.

Common Linux File Compression Algorithms

Compression algorithms are mathematical techniques used to compress files. They work by identifying repetitive patterns and encoding them more efficiently.

tar:

tar is not a compression algorithm but a utility commonly used with compression algorithms.
It creates an archive of multiple files and directories but doesn't compress them by default. It is used in conjunction with compression algorithms like gzip, bzip2, or xz to create compressed tar archives (e.g., ".tar.gz", ".tar.bz2", ".tar.xz").
This combination allows for bundling files together and compressing them simultaneously.

gzip:

gzip (GNU zip) is one of the most commonly used compression algorithms in Linux.
It uses the DEFLATE algorithm, which combines the LZ77 algorithm and Huffman coding.
gzip provides relatively fast compression and decompression speeds.
It is widely supported and can be used to compress individual files or concatenate multiple files into a single compressed file with the ".gz" extension.

bzip2:

bzip2 is another popular compression algorithm used in Linux.
It employs the Burrows-Wheeler block sorting algorithm and a variant of the Move-to-Front algorithm, followed by Huffman coding.
bzip2 typically provides better compression ratios compared to gzip but at the cost of slower compression and decompression speeds.
Compressed files using bzip2 have the ".bz2" extension.

xz is a compression algorithm that uses the LZMA2 algorithm.
It offers excellent compression ratios and is known for its high compression efficiency.
However, the compression and decompression speeds of xz are slower compared to gzip and bzip2.
Compressed files using xz have the ".xz" extension.

Generating and Compressing a Sample File Using Gzip

Open a terminal and type the following. We will use dd to generate a file containing random data.
Generate a file using dd:

dd if=/dev/urandom of=~/large-file bs=1G count=1 oflag=direct

Compress the file:
In the simplest form, we use

gzip large-file

This will compress large-file and replace it with large-file.gz in the same directory. Use ls -lh to view the size of the resulting archive and compare it with the initial size.

Extracting Compressed Files

To extract our compressed file, we can use gzip -d large-file.gz or gunzip large-file.gz. This will recreate the original file without the .gz file extension.

This is a brief overview. The Linux man pages delve into details on how to use each of the utilities to create and extract archives, as well as how to affect compression ratios for more efficient space utilization.

linux