# - Acharya T.

ISBN 0-471-48422-9

**Download**(direct link)

**:**

**7**> 8 9 10 11 12 13 .. 100 >> Next

E(A) <L< E(A) + e

The left-hand inequality must be satisfied for any uniquely decodable code for the block of n source symbols.

The Noiseless Source Coding Theorem states that any source can be losslessly encoded with a code whose average number of bits per source symbol is arbitrarily close to, but not less than, the source entropy E in bits by coding infinitely long extensions of the source. Hence, the noiseless source coding theorem provides us the intuitive (statistical) yardstick to measure the information emerging from a source.

1.3.2.1 Example: We consider a discrete memoryless source with alphabet A\ = {a,/3,7,5} and the associated probabilities are p(a) = 0.65, p(/3) =

0.20, p(7) = 0.10, p(6) = 0.05 respectively. The entropy of this source is E — —(0.651og2 0.65 + 0.20log2 0.20 + 0.10log2 0.10 + 0.05log2 0.05), which is approximately 1.42 bits/symbol. As a result, a data sequence of length 2000 symbols can be represented using approximately 2820 bits.

Knowing something about the structure of the data sequence often helps to reduce the entropy estimation of the source. Let us consider that the numeric data sequence generated by a source of alphabet A2 = {0,1,2,3} is D — 0112333333333222333 3, as an example. The probability of appearance of the symbols in alphabet A2 are p(0) = 0.05, p(l) = 0.10, p(2) = 0.20, and p(3) = 0.65 respectively. Hence the estimated entropy of the sequence D is E = 1.42 bits per symbol. If we assume that correlation exists between two consecutive samples in this data sequence, we can reduce this correlation by simply subtracting a sample by its previous sample to generate the residual values r* = Si — Sj_i for each sample Sj. Based on this assumption of the model, the sequence of residuals of the original data sequence is 5 = 0101100000000 -1 00100 0, consisting of three symbols in a modified alphabet A2 = {—1,1,0}. The probability of occurrence of the symbols in the new alphabet A are P(—1) = 0.05, p(l) = 0.2, and p(0) = 0.75 respectively as computed by the number of occurrence in the residual sequence. The estimated entropy of the transformed sequence

8 INTRODUCTION TO DATA COMPRESSION

is E = -(0.05 log2 0.05 + 0.2 log2 0.2 + 0.75 log2 0.75) = 0.992 (i.e., 0.992 bits/symbol).

The above is a simple example to demonstrate that the data sequence can be represented with fewer numbers of bits if encoded with a suitable entropy encoding technique and hence resulting in data compression.

1.3.3 Unique Decipherability

Digital representation of data in binary code form allows us to store it in computer memories and to transmit it through communication networks. In terms of length of the binary codes, they can be fixed-length as shown in column A of Table 1.1 with alphabet {a,/?, 7, <5}, as an example, where all the symbols have been coded using the same number of bits. The binary codes could also be variable-length codes as shown in columns B or C of Table 1.1 in which the symbols have different code lengths.

Table 1.1 Examples of Variable-Length Codes

Symbol A B c

Q 00 0 0

P 01 10 1

7 10 110 00

5 11 111 01

Consider the string S = aa'jaPaS. The binary construction of the string S using variable-length codes A, B, and C is as follows:

CA(S) = 00001000010011

CB(S) = 001100100111

Cc{S) = 000001001.

Given the binary code CA(S) = 00001000010011, it is easy to recognize or uniquely decode the string S = aa'yaj3aS because we can divide the binary string into nonoverlapping blocks of 2 bits each and we know that two consecutive bits form a symbol as shown in column A. Hence the first two bits “00” form the binary code for the symbol a, the next two bits “00” is similarly mapped to the symbol a, the following two bits “10” can be mapped to symbol 7, and so on. We can also uniquely decipher or decode the binary code Cb{S) — 001100100111 because the first bit (0) represents the symbol a; similarly the next bit (0) also represents the symbol a according to the code in column B. The following three consecutive bits “110” uniquely represent the symbol 7. Following this procedure, we can uniquely reconstruct the string

S = qq7q/3q<5 without any ambiguity.

CLASSIFICATION OF COMPRESSION ALGORITHMS

9

But deciphering the binary code Cc(S) = 000001001 is ambiguous because it has many possibilities—077/37/?, a'yaS'y/3, or aaaaa/37/3 to name a few. Hence the code Cc(S) = 000001001 is not uniquely decipherable using the code in column C in Table 1.1.

It is obvious that the fixed-length codes are always uniquely decipherable. But not all the variable-length codes are uniquely decipherable. The uniquely decipherable codes maintain a particular property called the prefix property. According to the prefix property, no codeword in the code-set forms the prefix of another distinct codeword [5]. A codeword C = c^cic?- • -c^-i of length k is said to be the prefix of another codeword D = d0di ■ ■ -dm^\ of length m if Cj = di for all i = 0,1, • • •, k — 1 and k<m.

**7**> 8 9 10 11 12 13 .. 100 >> Next