File Digest
Digest Information

Tutorial: File Digest

A file digest is a short digital summary of a file. The digest is used to ensure that a known file is being evaluated.

Digest Process
Meta Properties
Timestamps
Cryptographic Checksums
Other Digests

Digest Process

When evaluating evidence, an analyst needs some method for verifying that they have the right data. This ensures that they are analyzing the correct file and not some file variant that may have been altered.

There are different ways to summarize a file. The most common methods are based on meta properties and cryptographic checksums.

Meta Properties

Meta properties consist of information about a file. These are typically the type of file (e.g., PNG or JPEG), file size, and picture dimensions. If a picture does not have the correct dimensions or has a different file size, then the analyst can immediately identify that the file is different from the expectation.

It is usually a good idea to also track the number of color channels. A picture with one (1) color channel is monochrome. Three (3) color channels are usually translated as RGB, but may actually be a JPEG's YUV data streams. Four (4) color channels are usually RGB with a transparency (alpha) channel -- even if the transparency is unused -- but it can also be a JPEG encoded with CMYK or similar color transformation information.

While the file's name is typically recorded as a meta property, the name is less important than other information. This is because software, services, and examiners may rename files. In addition, changes may be saved to the same file name, altering the data without changing the name. File names are usually not unique and are frequently altered.

It is important to remember that none of the meta properties are unique. Two very different pictures can have the same dimensions, file sizes, etc. Moreover, changes can be made inside the file, such as altering a comment or modifying a timestamp, without altering the dimensions or file size. Meta properties cannot identify tampering.

Meta properties provide a simple way to summarize the picture. For example, if you identify the image as a JPEG that is 300x500 but your corworker says it should be a PNG that is 350x600, then you know you have the wrong file.

Timestamps

Files are associated with timestamps that denote when the file was created or modified. These file timestamps are independent of the picture's metadata timestamp. They indicate the latest possible time that the file could have been modified.

Web servers typically report timestamps using the HTTP 'Last-Modified' field. This field provides the time associated with the file on the server. However, dynamic web pages may report the current time or the time when the content was cached.

Timestamps are typically unreliable for determining when a picture was created. The simple act of copying a file or viewing a file may update timestamps. And downloading a file to your computer may set the timestamp to the download time and not the Last-Modified time. In addition, computer clocks may be inaccurate (usually by seconds or minutes) and it is trivial for users to intentionally backdate a file's timestamp.

However, timestamps can be very useful because (assuming they are not intentionally altered) they indicate the latest possible time that the file could have been modified. Internal metadata timestamps that are dated after they appeared on the computer system may be indications of tampering. And if a file appears with different filenames and different timestamps, then the oldest timestamp may be related to the original filename.

Cryptographic Checksums

In contrast to meta properties, cryptographic checksums (also called hashes or digests) act like digital fingerprints. It is extremely unlikely for two different files to have the same cryptographic checksum values. The most common cryptographic checksum algorithms are:

MD5: The Message-Digest Algorithm 5 (MD5) generates a 128-bit digest of the file. The hash is typically written as 32 alphanumeric (hexidecimal) values.
SHA1: The Secure Hash Algorithm Version 1 (written SHA1 or SHA-1) is similar to MD5, but it generates a 160-bit hash value. Compared to MD5, SHA1's longer hash size and alternate computation method lowers the likelihood of a hash-collision, where two different files generate the same hash value.
SHA256: The Secure Hash Algorithm Version 2 (SHA2) was designed to replace SHA1 due to a theoretical mathematical weakness. Unlike SHA1, SHA2 defines a family of functions that vary by bitsize: 224, 256, 384 or 512 bits. Each function is identified by the bit length. For example, SHA256 is the 256-bit SHA2 hash function. Along with SHA2 is SHA3, which defines even longer hash sizes.

With cryptographic checksums, a single file will always generate the same hash value. Any minor change to the file will cause a significantly different result. Even if the files have the same size and appearance, a single byte change will alter the digest. Moreover, the cryptographic complexity means that it is virtually impossible for someone to fiddle with the bytes in order to match the original checksum.

Representation: Base16 and Base32

When we think of numbers, we typically think of 10 digits: 0 through 9. In mathematics, that format is called base10.

Cryptographic hashes generate long sequences of binary data. For computers, it is easiest to represent these using 16 character (0 through 9 and 'A' through 'F'). This is called base16, hexidecimal, or hex; these names are equivalent. A single byte of data can be represented as two base16 characters. It is very easy for a computer to convert between base10 and base16; they are just different ways of representing the same numbers.

In some cases, systems prefer to use base32 instead of base16. The character string for base32 is 20% shorter than the string produced by base16, yet it represents the same numeric value. Base32 uses the characters 'a' through 'z' and '2' through '7'. The digits '0' and '1' are skipped since they can be confused with the letters 'o' and 'l'.

SHA1 generates 20 bytes (160 bits) of data for the hash value. This binary value can be represented using the base10, base16, or base32 formats:

Base32: govpyuzfdw7ncnv7ph5q55mxy6byxxwv
Base16: 33aafc53251dbed136bf79fb0ef597c7838bded5
Base10: 294,971,636,584,593,162,376,392,554,553,563,789,178,727,816,917

If a human needs to manually compare two hashes, then it is easier to compare the base16 or base32 strings than the longer base10 number.

Today, most systems use base16 to represent hashes. However, some newer systems are moving to base32 because it has fewer characters.

These digests can be used to detect tampering. By verifying a file's hash value, an analyst can confirm that they are evaluating the correct file. If the hashes differ, then it is either the wrong file or the evidence has been altered. A different hash value means that at least one byte was changed, but it does not idenitfy what was changed, who changed it, or when the change occurred.

Hash Collisions

While almost perfect, cryptographic checksums are vulnerable to hash collisions. This happens when two different files have the same checksum value.

Each checksum is a small digest of the entire file. For example, the MD5 checksum is only 16 bytes long (128 bits). This means, if you generate every combination of 17-byte long files, you will have at least 256 files with identical MD5 checksums. Fortunately, files that are not random data do not cover every possible bit combination. This sparse coverage means that two valid pictures with the same MD5 are likely the same picture.

Methods have been developed to intentionally generate two files with the same MD5 hash values. The MD5 attack requires changing large blocks of data that contain random-looking values; the attack changes the file's contents and may alter the file's size. For SHA1, the attack requires trying an estimated 2⁶¹ (2,305,843,009,213,693,952) variations of the file; a determined attacker is unlikely to find a way to replicate a known hash value on a tampered file, even if they have a few years. SHA256 needs even more variations to intentionally generate a specific hash value.

In 2017, the first intentional SHA1 collision with a valid file format was identified (16 years after SHA1 became a standard). This research required 6,500 years worth of computations, that they divided among 6,500 computers (so it took them a year). So far, there has not been any collisions identified with SHA256, but that's only a matter of time. To date, there have been no detected hash collisions that span these algorithms; even if collision files have the same SHA1 checksums, they have different MD5 values.

Although hash collisions are technically possible, it is extremely unlikely for two image files to contain similar pictures, have valid file formats, and generate the same cryptographic checksum values. When using multiple digests to confirm a digital picture (e.g., using both MD5 and SHA1, or SHA1 and file size to identify a valid JPEG), it becomes effectively impossible to have a hash collision.

In general, MD5 and SHA1 are commonly used for file checksums. SHA1 is more robust than MD5, but MD5 is typically complex enough for hashing pictures. The SHA2 family of functions are better suited to security-sensitive applications, such as digital signatures for encrypted data streams. While less common, SHA256 has been used as a checksum for authenticating sensitive evidence files.

SHA1 and MD5 are considered 'old' algorithms, but they are still widely used in the areas of image analysis and related forensics. (They have been around for decades and are not going away anytime soon.) Other algorithms, such as SHA256, SHA384, SHA512, etc. are used in niche fields, but are not consistently used across multiple fields. (For example, one lab may use SHA254, while another may prefer SHA512. But they probably both support SHA1 and MD5.)

Other Digests

There are many other types of checksums. Some, like CRC-16 and CRC-32, are used for quickly checking consistency. However, these cyclic redundancy check (CRC) hashes are not unique and have frequent collisions. Detecting the same CRC-32 value on two files is not an indication that the files are the same. However, different CRC values does denote a difference in the files.

Even among cryptographic hash function, there are a wide variety of algorithms and hash size. For example, MD4 is a much weaker alternative to MD5, and most of the SHA family of algorithms, such as SHA2's SHA-384 and SHA3's SHA3-512, are uncommon outside of strong cryptographic systems.

FotoForensics Digest Information

For digital computer evidence, the most commonly recorded information consists of the picture's type, dimensions, file size, and either the MD5 or SHA1 checksum values. (Within FotoForensics, each file's ID consists of the SHA1 digest and file's size.)

The digests provided by FotoForensics includes:

Property	Purpose
Filename	Due to privacy concerns, users only see the filenames that they uploaded.
Timestamp	This earliest timestamp set by the web server or uploading computer. Each timestamp is associated with a specific filename.
Type of Image	E.g., JPEG or PNG
Dimensions	Size of the picture in pixels.
Color Channels	Number of colors. 1 is monochromatic or grayscale. 3 is typically YUV (JPEG) and RGB (non-JPEG). 4 usually indicates a transparency (alpha channel)
Unique Colors	Number of unique colors in the picture. Photos typically have over 100,000 unique colors. Low-contrast photos may have 20,000 unique colors, but even a small picture can have thousands of unique colors. Some file formats, like GIF, are limited to a maximum of 256 colors. Generally, computer drawings have very few colors.
File Size	Number of total bytes in the file.
MD5	128-bit checksum
SHA1	160-bit checksum
SHA256	256-bit checksum
First Analyzed	The first time the picture was uploaded to this service for analysis. Multiple people may have uploaded the same file. For evidence handling, this specifies that the file is not newer than this time stamp.

This information is enough for an analyst to verify that they are examining the correct file. If can also be used to ensure that a file was not altered by the upload or storage process.