Tutorial: File Digest
A file digest is a short digital summary of a file. The digest is used to ensure that a known file is being evaluated.
Digest Process
When evaluating evidence, an analyst needs some method for verifying that they have the right data. This ensures that they are analyzing the correct file and not some file variant that may have been altered.
There are different ways to summarize a file. The most common methods are based on meta properties and cryptographic checksums.
Meta properties consist of information about a file. These are typically the type of file (e.g., PNG or JPEG), file size, and picture dimensions. If a picture does not have the correct dimensions or has a different file size, then the analyst can immediately identify that the file is different from the expectation.
It is usually a good idea to also track the number of color channels. A picture with one (1) color channel is monochrome. Three (3) color channels are usually translated as RGB, but may actually be a JPEG's YUV data streams. Four (4) color channels are usually RGB with a transparency (alpha) channel -- even if the transparency is unused -- but it can also be a JPEG encoded with CMYK or similar color transformation information.
While the file's name is typically recorded as a meta property, the name is less important than other information. This is because software, services, and examiners may rename files. In addition, changes may be saved to the same file name, altering the data without changing the name. File names are usually not unique and are frequently altered.
It is important to remember that
none of the meta properties are unique. Two very different pictures can have the same dimensions, file sizes, etc. Moreover, changes can be made inside the file, such as altering a comment or modifying a timestamp, without altering the dimensions or file size. Meta properties cannot identify tampering.
Meta properties provide a simple way to summarize the picture. For example, if you identify the image as a JPEG that is 300x500 but your corworker says it should be a PNG that is 350x600, then you know you have the wrong file.
Timestamps
Files are associated with timestamps that denote when the file was created or modified. These file timestamps are independent of the picture's metadata timestamp. They indicate the latest possible time that the file could have been modified.
Web servers typically report timestamps using the HTTP 'Last-Modified' field. This field provides the time associated with the file on the server. However, dynamic web pages may report the current time or the time when the content was cached.
Timestamps are typically unreliable for determining when a picture was created. The simple act of copying a file or viewing a file may update timestamps. And downloading a file to your computer may set the timestamp to the download time and not the Last-Modified time. In addition, computer clocks may be inaccurate (usually by seconds or minutes) and it is trivial for users to intentionally backdate a file's timestamp.
However, timestamps can be very useful because (assuming they are not intentionally altered) they indicate the latest possible time that the file could have been modified. Internal metadata timestamps that are dated
after they appeared on the computer system may be indications of tampering. And if a file appears with different filenames and different timestamps, then the oldest timestamp may be related to the original filename.
Cryptographic Checksums
In contrast to meta properties, cryptographic checksums (also called
hashes or
digests) act like digital fingerprints. It is extremely unlikely for two different files to have the same cryptographic checksum values. The most common cryptographic checksum algorithms are:
- MD5: The Message-Digest Algorithm 5 (MD5) generates a 128-bit digest of the file. The hash is typically written as 32 alphanumeric (hexidecimal) values.
- SHA1: The Secure Hash Algorithm Version 1 (written SHA1 or SHA-1) is similar to MD5, but it generates a 160-bit hash value. Compared to MD5, SHA1's longer hash size and alternate computation method lowers the likelihood of a hash-collision, where two different files generate the same hash value.
- SHA256: The Secure Hash Algorithm Version 2 (SHA2) was designed to replace SHA1 due to a theoretical mathematical weakness. Unlike SHA1, SHA2 defines a family of functions that vary by bitsize: 224, 256, 384 or 512 bits. Each function is identified by the bit length. For example, SHA256 is the 256-bit SHA2 hash function. Along with SHA2 is SHA3, which defines even longer hash sizes.
With cryptographic checksums, a single file will always generate the same hash value. Any minor change to the file will cause a significantly different result. Even if the files have the same size and appearance, a single byte change will alter the digest. Moreover, the cryptographic complexity means that it is virtually impossible for someone to fiddle with the bytes in order to match the original checksum.
These digests can be used to detect tampering. By verifying a file's hash value, an analyst can confirm that they are evaluating the correct file. If the hashes differ, then it is either the wrong file or the evidence has been altered. A different hash value means that at least one byte was changed, but it does not idenitfy
what was changed,
who changed it, or
when the change occurred.
Although hash collisions are technically possible, it is extremely unlikely for two image files to contain similar pictures, have valid file formats, and generate the same cryptographic checksum values. When using multiple digests to confirm a digital picture (e.g., using both MD5
and SHA1, or SHA1
and file size to identify a valid JPEG), it becomes effectively impossible to have a hash collision.
In general, MD5 and SHA1 are commonly used for file checksums. SHA1 is more robust than MD5, but MD5 is typically complex enough for hashing pictures. The SHA2 family of functions are better suited to security-sensitive applications, such as digital signatures for encrypted data streams. While less common, SHA256 has been used as a checksum for authenticating sensitive evidence files.
SHA1 and MD5 are considered 'old' algorithms, but they are still widely used in the areas of image analysis and related forensics. (They have been around for decades and are not going away anytime soon.) Other algorithms, such as SHA256, SHA384, SHA512, etc. are used in niche fields, but are not consistently used across multiple fields. (For example, one lab may use SHA254, while another may prefer SHA512. But they probably both support SHA1 and MD5.)
Other Digests
There are many other types of checksums. Some, like CRC-16 and CRC-32, are used for quickly checking consistency. However, these cyclic redundancy check (CRC) hashes are not unique and have frequent collisions. Detecting the same CRC-32 value on two files is not an indication that the files are the same. However, different CRC values does denote a difference in the files.
Even among cryptographic hash function, there are a wide variety of algorithms and hash size. For example, MD4 is a much weaker alternative to MD5, and most of the SHA family of algorithms, such as SHA2's SHA-384 and SHA3's SHA3-512, are uncommon outside of strong cryptographic systems.
FotoForensics Digest Information
For digital computer evidence, the most commonly recorded information consists of the picture's type, dimensions, file size, and either the MD5 or SHA1 checksum values. (Within FotoForensics, each file's ID consists of the SHA1 digest and file's size.)
The digests provided by FotoForensics includes:
This information is enough for an analyst to verify that they are examining the correct file. If can also be used to ensure that a file was not altered by the upload or storage process.