refactorman

# Exploring the Discrete Cosine Transform

A discrete cosine transform (DCT) expresses a finite sequence of data points in terms of a sum of cosine functions oscillating at different frequencies.

The DCT is used extensively in lossy audio and image compression. It can be found in MP3 audio compression, JPEG image compression, and MPEG-1/2 video compression. Let’s take a closer look the fundamentals of how a DCT is practically used.

## TL;DR

For JPEG and MPEG video, the formal definition of an 8×8 DCT:

Where:

Implemented in JavaScript as:

And the corresponding inverse DCT: The DCT-III, also known as the inverse DCT

Where:

Also implemented in JavaScript as:

These JavaScript implementations are naïve implementations, i.e. that are quite computationally expensive and unoptimized. They are, however, relatively easy to reason about.

For reference, there are also several high performance variations and approximations, that reduce the number of mathematical operations required, at the cost of readability. For example:

## So, what does it do?

I’m going to start by looking at a simpler one-dimensional example.

We will begin with a signal, a linear ramp, like this:

The input signal has been sampled 8 times, each sample is independent from it’s neighboring samples; i.e. it represents the value of the signal at a specific point in space or time. A DCT transforms these discrete data points into a sum of cosine functions, each oscillating at a different frequencies and at different magnitudes.

Specifically, the DCT will transform this of 8 samples into 8 coefficients. Each coefficient will multiply a specific cosine frequency – altering that magnitude of that function and its corresponding impact on the reconstructed signal.

Let’s look at the formal definition of the forward and inverse one-dimensional DCT that we will be using.

First, we can focus just on the forward DCT transformation. We can translate the forward equation into the following JavaScript:

This function will take an array of samples and return an array of equal length DCT coefficients. Now, we can use this function to transform our input signal. Something like:

The resulting array will be 8 elements long and will look like this:

## Reconstructing the signal

Now that we have our set of coefficients, how do we transform it back into the original signal? For that, we use the inverse DCT.

Again, we can express this in JavaScript as:

Let use it to reconstruct our signal:

Again, this function returns the same number of samples as our coefficients. And aside from some small floating-point rounding errors, the reconstructed signal is identical to the original signal.

## Okay, but why?

Up to now, you may have noticed that each transformation has been of equivalent length; e.g., n samples become n coefficients and vice versa. So, how is this actually useful for compression?

Look again at the coefficients of our compressed signal:

Notice, that the first two coefficients have a relatively large magnitude, while the rest are fairly close to zero. This is because our source signal was a simple ramp: it’s value increased by 8 units at each sample.

As such, most of the energy of the signal can be expressed in the lower frequencies, while the higher frequencies have less of an overall impact on the desired signal. The DCT exploits this property; this is referred to as energy compaction.

If our initial signal was comprised of white noise, i.e. static, there would be less – if any – energy compaction. However many real-world samples, whether aural or visual, the signals tend to be somewhat ordered, and better suited for this type of energy compaction.

We use can quantization to squash our coefficients, which are currently real numbers, into a smaller range of integers. As a simplistic implementation, we can divide each coefficient by 50 the choice of 50 is a completely arbitrary selection on my part, it can be any number really for our purposes and truncate the result.

After quantization, we now only have two coefficients that have values, and a long run of values of zero. This set can be run-length encoded much smaller than the original set of samples.

This is fundamentally how the DCT is used for audio and visual compression.

## Lossy Compression

If you have a keen eye then you may have noticed something interesting during the quantization step in the last section. We truncated our real values into integers, i.e. we threw away some data.

While that made the data more compressible, what effect does that have on our reconstructed signal? Let’s find out!

First, we need to de-quantize our coefficients:

Notice they are not the same as the coefficients we calculated before, due to the truncation. Now, let’s run the inverse DCT and see what signal we get back:

At first glance, the reconstructed signal appears similar. However, on closer inspection, you can see they are actually different. That is because we threw away some of the smaller, high-frequency coefficients that were subtly adjusting the reconstructed signal. Without those frequencies, the new signal drifts away from the original.

However, compare them together on the same chart:

Not quite the same, but close – and that’s the idea. We can use the DCT, to transform our set into a more compressible set data that can be reconstructed into a similar signal, but not identical. We have, however, lost some of the original data in the process; that is what is meant by “lossy” compression.

By adjusting the quantization value (or using a quantization matrix), one can control the balance between compressibility and signal fidelity of the transformation.

## Next up…

Let’s explore deeper into the DCT:

• Visualization: Behold the waveforms of the DCT and how the signal is regenerated.
• Take it to the Next Dimension: Look at some 2D versions of the DCT.
• Indiana Jones: Explore some visual implications/artifacts of compression.