What is Inside a JPEG File

Carl Salvaggio, Rochester Institute of Technology

During the recent transition from conventional to digital photography, most consumers use the term JPEG ubiquitously as a reference to digital images. It is common to hear someone refer to a digital image as a JPEG file. In general, most digital cameras on the market today are capable of storing photographs in this file format. As common as these files are in everyone's life, very few users actually know what goes on inside a JPEG file and still fewer understand the many pitfalls associated with using this choice when storing their digital photographs. This essay will describe the inner working of this file format that we so commonly use, highlighting its many merits and it many flaws. Like any tool that is available to us, there are correct ways to use the JPEG file format and there are times when it is a poor choice. Once you understand how this format works, hopefully the choices will be obvious and you can more wisely take advantage of this truly remarkable technique.

The Joint Photographic Experts Group (JPEG) format for image storage represents a series of techniques aimed at reducing the redundancy present in most image data. Redundancy is defined as "the use of words or data that could be omitted without loss of meaning or function; repetition or superfluity of information" by the Oxford American Dictionary. Digital images show redundancy in data in several ways.

The brightness value or digital count of two neighboring pixels is more often than not similar, if not the same, in magnitude. That being the case, you would more often than not be correct if you were to guess that the next pixel's brightness count value in a row of picture elements was the same as the previous pixel's value. If that is indeed the case, then it can be said that we don't need to use all of the data in an image to represent the information in the photograph. This particular type of redundancy is known as inter-pixel redundancy.

The manner in which we represent photographs as digital images is by using numbers between 0 (black) and 255 (white). Why this choice for a scale? These are the values that can be stored in a single byte of memory. So by design, we use 8-bits of data (8 bits = 1 byte) to represent every pixel's brightness in an image. This, however, is not the most efficient means of representing this information. Claude Shannon proposed many of the elements that make up modern information theory, the most relevant to this problem being that it is more efficient to use short words to represent more commonly occurring events and reserve longer words for those events that occur less frequently. Shannon's "events" in the case of digital imagery are pixels with particular digital counts. Shannon proposed that we should use fewer bits to represent digital count values that occur more frequently in an image and greater numbers of bits to represent those digital counts that occur less often. This efficient means of representing digital counts is known as "coding" and reduces what is known as coding redundancy.

Lastly, the human visual system, as remarkable as it is, is easy to fool. If you reduce the number of digital count values that are used to represent the scale from black to white in a digital image, the human visual system is slow to pick up on the differences. For example, if the number of digital counts is reduced from 256 to 64 using straightforward grey-level quantization, very few observers will notice the difference (see figure below). This obviously will depend on the image content; however, the weaknesses of the human visual system can be exploited to reduce the third type of data redundancy in images known as psycho-visual redundancy.

Figure 1a     Figure 1b
(a)   (b)
Figure 1. Psycho-visual redundancy - The difference between digital images to the human visual system is often in indistinguishable. The image represented in (a) uses 256 unique digital counts to represent the scale from black to white while the image in (b) only uses 64 unique values to represent this same scale.

So the committee who designed what is today referred to as JPEG set out to minimize each of these redundancies as much as possible and in a computationally efficient manner. This is a process known as image compression.

The JPEG image compression technique consists of 5 functional stages. They are

  1. an RGB to YCC color space conversion,
  2. a spatial subsampling of the chrominance channels in YCC space,
  3. the transformation of a blocked representation of the YCC spatial image data to a frequency domain representation using the discrete cosine transform,
  4. a quantization of the blocked frequency domain data according to a user-defined quality factor, and finally
  5. the coding of the frequency domain data, for storage, using Huffman coding.

The human visual system relies more on spatial content and acuity than it does on color for interpretation. For this reason, a color photograph, represented by a red, green, and blue image, is transformed to different color space that attempts to isolate these two components of image content; namely the YCC or luminance/chrominance-red/chrominance-blue color space. This color space transformation is performed on a pixel-by-pixel basis with the digital counts being converted according to the following rules

Figure 2
Figure 2.

where the R, G, and B terms represent the red, green, and blue digital count for a particular pixel and the Bit Depth is the number of bits used to store each pixel's brightness value (typically 8 for most consumer cameras). An example transformation is shown below.

Figure 3a   Figure 3b   Figure 3c   Figure 3d
RGB   Y   Cb   Cr
Figure 3. RGB to YCC Conversion - The original RGB image and the computed luminance (Y), chrominance-blue (Cb), and chrominance-red (Cr) images.

The luminance image carries the majority of the spatial information of the original image and is indeed just a weighted average of the original red, green, and blue digital count values for each pixel. The two chrominance images show very little spatial detail. This is fortuitous for the goal of compression.

The JPEG process subsamples the individual chrominance images before proceeding to half the number of individual rows and columns. Since there is little spatial detail in these channels, the subsampling does not discard much meaningful data. This results in one quarter of the number of pixels where in the original representations. The human visual system is; however, easily fooled and the resulting true color image that is formed by inverting this subsampling/color space transformation process is virtually indistinguishable from the original unless viewed at very high magnifications.

Figure 4a   Figure 4b   Figure 4c   Figure 4d
Y   Cb   Cr   (a)
  Figure 4e
  Figure 4f
Figure 4. Inverse subsampling/color space transform - The result of the (a) inverse subsampling/color space transformation is virtually indistinguishable from the original image shown in the previous figure. The effects of subsampling can be seen in the magnified image subsections shown for the (b) original image and the resulting (c) inverse transformed image.
Figure 5
Figure 5. Image blocks - A small section of the image previously shown that has been segmented into blocks that are 8x8 pixels in size.

As the first two phases of the JPEG process attempt to take advantage of the weaknesses in the human visual system and reduce psycho-visual redundancy, the next phase attempts to exploit the inter-pixel redundancy present in most image data. If an image is broken up into small subsections or blocks, the likelihood that the pixels in these blocks will have similar digital count levels is high for the majority of the blocks throughout the image. Blocks that include high contrast image features such as edges will obviously not exhibit this behavior.

As can be seen in the previous figure, almost half of the blocks shown contain skin-toned pixels with very little high frequency information. The advantageous result of this fact is that the frequency-domain representation of the data in any one of these blocks that exhibits grey-level constancy in the luminance or chrominance will consist of relatively few non-zero or significant values. The frequency domain transformation chosen by the JPEG members is the discrete cosine transform (DCT). This was chosen over the more traditional Fourier transform since it produces real-valued rather than imaginary-valued transform coefficients that are more easily stored in a compact fashion in memory. The DCT coefficients are computed for a one-dimensional function as follows.

Figure 6a
Figure 6b
Figure 6.

where M is the number of points in the function f(x). The DCT is performed on two-dimensional data sets as a series of consecutive one-dimensional transformations on the rows and subsequently the columns of the two dimensional array. The inverse transformation to take the frequency domain data back to the original spatial image data space is given by

Figure 7a
Figure 7b
Figure 7.

So for each 8x8 block of pixels in the original luminance and chrominance images, the DCT is computed. For the majority of blocks in the image, only some small number of the 64 pixels in the 8x8 block will have DCT coefficients that are significant in magnitude. The following figure illustrates the results of a DCT transformation on two blocks of varying degrees of grey-level constancy.

Figure 8a Figure 8b Figure 8c
Figure 8d
Figure 8e Figure 8f Figure 8g
Figure 8h
Figure 8. DCT transform - The discrete cosine transform coefficients represent the power of each frequency present in the sub-image blocks shown in the images to the left. The images in the center are a magnified version of the sub-image block shown. The images to the right represent a scaled visualization of the discrete cosine transform coefficients shown in the tables below each set of images. The data shown in (a) represent a smooth area in the original image while those shown in (b) represent a higher frequency region.


The DCT coefficients are computed for each 8x8 block of pixels in the image. To this point, the entire JPEG process is completely reversible except for the losses due to subsampling of the two chrominance channels.

The next processing step in the chain of computations that make up JPEG image compression is the quantization of the DCT coefficients in each of the 8x8 blocks. It is at this step that the process is able to achieve the most compression; however, it is at the expense of image quality. The entire process becomes what is referred to in the image compression community as "lossy". The process is still reversible, however, it can no longer exactly reproduce the original image data.

As we have already seen, the DCT coefficients get smaller in magnitude as one moves away from the lowest frequency component (always located in the upper left hand corner of the 8x8 block). Quantization of the DCT coefficients scales each of the DCT coefficients by a prescribed, and unique factor, whose strength relies on the quality factor specified by the user. The JPEG committee prescribes for the luminance channel and for both chrominance channels the quantization factors. These scaling factors are used to divide, on a coefficient by coefficient manner, the DCT coefficients in each 8x8 block. Each element of the scaled coefficient values is then rounded off and converted to an integer value. The scaling factors are given in the following illustration.

Figure 9
Figure 9. DCT coefficient quantization factors - The discrete cosine transform coefficients quantization factors are given (a) for use with the luminance channel and (b) for use with both the chrominance-blue and chrominance-red channels.

The quantized DCT coefficients are computed by applying the quantization factors, represented as Q, to the DCT coefficients as

Figure 10
Figure 10.

The factor, ScaleFactor, given in this equation is known as the scaling factor and is derived from a quality factor specified by the user. The quality factor is specified on a scale between 0 and 100 where a factor of 100 represents the best image quality (the least quantization). The relationship between the user-specified quality factor and the scaling factor is given by

Figure 11a

and is illustrated in the following plot

Figure 11b
Figure 11. DCT quantization table scale factors - The DCT quantization scale factors are given as a function of the user specified quality factor.

Once the DCT coefficients have been scaled, quantized, and converted to integer values, the data is ready for coding and storage. As stated in the beginning of this essay, Shannon stated that it is most efficient to store a message by using the shortest codewords for the most frequently occurring symbols and longer codewords for less probable symbols. At this stage, we have conditioned the DCT coefficients in such a way that they are ready for coding redundancy reduction.

The final step in the JPEG process is to use Huffman coding to represent the conditioned DCT coefficients in as efficient manner as possible. As it is outside of the scope of this essay to give a complete description of Huffman coding, the reader is referred to the many web sites and textbooks that describe this topic in great detail. David Huffman gives the original description in his 1952 paper in the Proceeding of the I.R.E.

Now that the reader is exposed to the inner working of JPEG, some of the caveats of this widely used image compression technique must be mentioned.

  1. This technique is best applied to photographs of natural scenes.
  2. This technique works best when the assumption of grey-level constancy is valid in 8x8 image blocks. The technique will prove less effective providing less compression when this assumption is violated. Images with a lot of noise, for example those taking at high ISO film speed settings in low-light level situations with a digital camera, will not compress as well as those that receive plenty of exposure and contain less noise.
  3. The use of JPEG for the storage of line art is not recommended. There are a lot of areas in images of line art that are constant grey level, and this is good, but the main content of interest in this type of image is the black lines on a white background, which the user expects to be crisp and sharp. The use of JPEG will result in unacceptable artifacts in blocks where there is high contrast.
  4. Figure 12a Figure 12b
    Figure 12. JPEG effects on line art - The image on the left is an original digital image of the serif on a lowercase "a". The image on the right shows the artifacts that are seen when this image is stored as a JPEG file with a user-specified quality factor of 20.
  5. Never use JPEG as an intermediate storage format. JPEG should only be used to store the final image that results from your processing steps. If you take a picture, remove the red-eye, sharpen the edges, and then want to display it on the web; it is only for the final storage of the processed image for publishing that JPEG should be used.
  6. One final note. As convenient as it is to store hundreds of images on the storage media in your digital camera, you might want to reconsider using JPEG as the default storage format (see the figure below). If you are going to be processing the images after you download them from your camera, you may want to consider a lossless format such as TIFF or RAW for use in your camera.
Figure 13a   Figure 13b   Figure 13c
(a)   (b)   (c)
Figure 13. Effects of JPEG - The effects of using the JPEG file format are shown. The image shown in (a) is the original image. Image (b) shows the results of JPEG compression using a "medium" quality setting on a digital camera. Image (c) shows the results of JPEG compression using a "low" quality setting on a digital camera. While higher quality setting will not result in such objectionable images, the photographer should always be cognizant that these 8x8 blocking artifacts will always be present in any image saved as a JPEG file from a digital camera or saved as a processed product from an image processing program.