Archiving 2012 Join us in Copenhagen! Join colleagues to discuss the latest in digital archiving for cultural heritage institutions. Learn how to submit your paper for this June 2012 meeting.
Have you downloaded the IS&T iPhone App Yet? If not, get it now.
This calendar shows the current month. To view other months, click on any date and navigate to the desired month in the pop-up menu.
During the recent transition from conventional to digital photography, most consumers use the term JPEG ubiquitously as a reference to digital images. It is common to hear someone refer to a digital image as a JPEG file. In general, most digital cameras on the market today are capable of storing photographs in this file format. As common as these files are in everyone's life, very few users actually know what goes on inside a JPEG file and still fewer understand the many pitfalls associated with using this choice when storing their digital photographs. This essay will describe the inner working of this file format that we so commonly use, highlighting its many merits and it many flaws. Like any tool that is available to us, there are correct ways to use the JPEG file format and there are times when it is a poor choice. Once you understand how this format works, hopefully the choices will be obvious and you can more wisely take advantage of this truly remarkable technique.
The Joint Photographic Experts Group (JPEG) format for image storage represents a series of techniques aimed at reducing the redundancy present in most image data. Redundancy is defined as "the use of words or data that could be omitted without loss of meaning or function; repetition or superfluity of information" by the Oxford American Dictionary. Digital images show redundancy in data in several ways.
The brightness value or digital count of two neighboring pixels is more often than not similar, if not the same, in magnitude. That being the case, you would more often than not be correct if you were to guess that the next pixel's brightness count value in a row of picture elements was the same as the previous pixel's value. If that is indeed the case, then it can be said that we don't need to use all of the data in an image to represent the information in the photograph. This particular type of redundancy is known as inter-pixel redundancy.
The manner in which we represent photographs as digital images is by using numbers between 0 (black) and 255 (white). Why this choice for a scale? These are the values that can be stored in a single byte of memory. So by design, we use 8-bits of data (8 bits = 1 byte) to represent every pixel's brightness in an image. This, however, is not the most efficient means of representing this information. Claude Shannon proposed many of the elements that make up modern information theory, the most relevant to this problem being that it is more efficient to use short words to represent more commonly occurring events and reserve longer words for those events that occur less frequently. Shannon's "events" in the case of digital imagery are pixels with particular digital counts. Shannon proposed that we should use fewer bits to represent digital count values that occur more frequently in an image and greater numbers of bits to represent those digital counts that occur less often. This efficient means of representing digital counts is known as "coding" and reduces what is known as coding redundancy.
Lastly, the human visual system, as remarkable as it is, is easy to fool. If you reduce the number of digital count values that are used to represent the scale from black to white in a digital image, the human visual system is slow to pick up on the differences. For example, if the number of digital counts is reduced from 256 to 64 using straightforward grey-level quantization, very few observers will notice the difference (see figure below). This obviously will depend on the image content; however, the weaknesses of the human visual system can be exploited to reduce the third type of data redundancy in images known as psycho-visual redundancy.
![]() |
![]() |
| (a) | (b) |
So the committee who designed what is today referred to as JPEG set out to minimize each of these redundancies as much as possible and in a computationally efficient manner. This is a process known as image compression.
The JPEG image compression technique consists of 5 functional stages. They are
The human visual system relies more on spatial content and acuity than it does on color for interpretation. For this reason, a color photograph, represented by a red, green, and blue image, is transformed to different color space that attempts to isolate these two components of image content; namely the YCC or luminance/chrominance-red/chrominance-blue color space. This color space transformation is performed on a pixel-by-pixel basis with the digital counts being converted according to the following rules
where the R, G, and B terms represent the red, green, and blue digital count for a particular pixel and the Bit Depth is the number of bits used to store each pixel's brightness value (typically 8 for most consumer cameras). An example transformation is shown below.
![]() |
![]() |
![]() |
![]() |
| RGB | Y | Cb | Cr |
The luminance image carries the majority of the spatial information of the original image and is indeed just a weighted average of the original red, green, and blue digital count values for each pixel. The two chrominance images show very little spatial detail. This is fortuitous for the goal of compression.
The JPEG process subsamples the individual chrominance images before proceeding to half the number of individual rows and columns. Since there is little spatial detail in these channels, the subsampling does not discard much meaningful data. This results in one quarter of the number of pixels where in the original representations. The human visual system is; however, easily fooled and the resulting true color image that is formed by inverting this subsampling/color space transformation process is virtually indistinguishable from the original unless viewed at very high magnifications.
![]() |
![]() |
![]() |
![]() |
| Y | Cb | Cr | (a) |
![]() |
|||
| (b) | |||
![]() |
|||
| (c) | |||
As the first two phases of the JPEG process attempt to take advantage of the weaknesses in the human visual system and reduce psycho-visual redundancy, the next phase attempts to exploit the inter-pixel redundancy present in most image data. If an image is broken up into small subsections or blocks, the likelihood that the pixels in these blocks will have similar digital count levels is high for the majority of the blocks throughout the image. Blocks that include high contrast image features such as edges will obviously not exhibit this behavior.
As can be seen in the previous figure, almost half of the blocks shown contain skin-toned pixels with very little high frequency information. The advantageous result of this fact is that the frequency-domain representation of the data in any one of these blocks that exhibits grey-level constancy in the luminance or chrominance will consist of relatively few non-zero or significant values. The frequency domain transformation chosen by the JPEG members is the discrete cosine transform (DCT). This was chosen over the more traditional Fourier transform since it produces real-valued rather than imaginary-valued transform coefficients that are more easily stored in a compact fashion in memory. The DCT coefficients are computed for a one-dimensional function as follows.
where M is the number of points in the function f(x). The DCT is performed on two-dimensional data sets as a series of consecutive one-dimensional transformations on the rows and subsequently the columns of the two dimensional array. The inverse transformation to take the frequency domain data back to the original spatial image data space is given by
So for each 8x8 block of pixels in the original luminance and chrominance images, the DCT is computed. For the majority of blocks in the image, only some small number of the 64 pixels in the 8x8 block will have DCT coefficients that are significant in magnitude. The following figure illustrates the results of a DCT transformation on two blocks of varying degrees of grey-level constancy.
The DCT coefficients are computed for each 8x8 block of pixels in the image. To this point, the entire JPEG process is completely reversible except for the losses due to subsampling of the two chrominance channels.
The next processing step in the chain of computations that make up JPEG image compression is the quantization of the DCT coefficients in each of the 8x8 blocks. It is at this step that the process is able to achieve the most compression; however, it is at the expense of image quality. The entire process becomes what is referred to in the image compression community as "lossy". The process is still reversible, however, it can no longer exactly reproduce the original image data.
As we have already seen, the DCT coefficients get smaller in magnitude as one moves away from the lowest frequency component (always located in the upper left hand corner of the 8x8 block). Quantization of the DCT coefficients scales each of the DCT coefficients by a prescribed, and unique factor, whose strength relies on the quality factor specified by the user. The JPEG committee prescribes for the luminance channel and for both chrominance channels the quantization factors. These scaling factors are used to divide, on a coefficient by coefficient manner, the DCT coefficients in each 8x8 block. Each element of the scaled coefficient values is then rounded off and converted to an integer value. The scaling factors are given in the following illustration.
The quantized DCT coefficients are computed by applying the quantization factors, represented as Q, to the DCT coefficients as
The factor,
, given in this equation is known as the scaling factor and is derived from a quality factor specified by the user. The quality factor is specified on a scale between 0 and 100 where a factor of 100 represents the best image quality (the least quantization). The relationship between the user-specified quality factor and the scaling factor is given by
and is illustrated in the following plot
Once the DCT coefficients have been scaled, quantized, and converted to integer values, the data is ready for coding and storage. As stated in the beginning of this essay, Shannon stated that it is most efficient to store a message by using the shortest codewords for the most frequently occurring symbols and longer codewords for less probable symbols. At this stage, we have conditioned the DCT coefficients in such a way that they are ready for coding redundancy reduction.
The final step in the JPEG process is to use Huffman coding to represent the conditioned DCT coefficients in as efficient manner as possible. As it is outside of the scope of this essay to give a complete description of Huffman coding, the reader is referred to the many web sites and textbooks that describe this topic in great detail. David Huffman gives the original description in his 1952 paper in the Proceeding of the I.R.E.
Now that the reader is exposed to the inner working of JPEG, some of the caveats of this widely used image compression technique must be mentioned.
![]() |
![]() |
![]() |
| (a) | (b) | (c) |