Pictures often circulate the Internet. One person will see a picture online and cross-post it to another forum, where someone else will take a copy and distribute it through other online services.
Each time someone submits a picture to another online service, information may be lost and the image may be modified. Additional changes to the picture, such as scaling, cropping, or color adjustments, further alter the image. All of these changes impact the overall quality of the picture. Metadata and image artifacts may identify a low quality picture that has been altered by multiple online services. However, multiple resaves and edits may make it impossible to determine if a person was pasted into a picture.
Forensic investigators may have a low quality picture, or an image without any context. Online image search services permit analysts to find variations of the same pictures. The search results may identify high quality versions, distribution patterns, and circumstances that can provide context to the picture.
Most picture search engines take a text description as input and return a series of pictures that match the text description. If the context around the picture is known, then a wide range of online text-to-image search services are available. Unfortunately, an analyst may spend hours doing keyword searches and manually reviewing results, and may still not find the desired image.
The alternative to text-to-image search is to use an image-to-image (reverse image) search engine. This approach uses algorithms to identify similar pictures.
There are different approaches for finding visually similar images. The most direct approach is to use a cryptographic hash function, such as MD5 or SHA1. The same hash value identifies the exact same picture. However, a single byte change will result in significantly different hash values. Even though pictures may visually look the same, different bytes yield different hashes and are therefore different pictures.
Digital picture analysis typically relies on perceptual hashes for reverse image searches. A perceptual hash is an algorithm that will yield similar hash values for visually-similar pictures. There are a wide variety of perceptual hash functions. Different algorithms may focus on colors, edges, corners, 'blobs', or frequency patterns. In general, these types of algorithms can match similar pictures, even if there are significant size, quality, and coloring differences. They may also match pictures with minor differences in content, cropping, and rotation.
While there are many different perceptual hash algorithms, only a few perceptual hash search engines are publicly available. These services allow users to upload pictures as the search criteria. Rather than using words to find pictures, they permit the use of a picture to find similar pictures. The results include web pages that host variations of the picture. A few examples of perceptual search engines include:
TinEye: This search engine is exceptional at finding partial matches. Although it may not know immediately about new pictures, it will usually identify widely circulated pictures, as well as images from news outlets.
Google Image Search: This search engine has indexed most of the pictures found by Google, including late-breaking pictures that are only a few hours old. It is exceptional at identifying textual content related to pictures. However, Google Image Search is not strong at identifying significant variations from cropping, splicing, or editing.
Bing: Microsoft's Bing includes a search engine that matches based on similar shapes. However, it does not have a large corpus of indexed images and it usually does not find variations of the same picture. If the picture has a large rectangular region in the middle, then it will usually find other pictures with large rectangular regions in the middle. This is useful for finding visually similar images without finding variations of the exact same picture.
RootAbout: Hacker Factor (the same company that provides FotoForensics) has teamed up with the Internet Archive (archive.org) in order to provide RootAbout: a search-by-image capability for the millions of items collected at the Internet Archive. The Internet Archive is a non-profit library, hosting millions of free books, movies, software, music, and more.
Karma Decay: This specialized search engine matches against all pictures that have appeared on the Reddit social network. This is useful for identifying topics such as memes and controversial current events.
Each of these search engines serves a different purpose. For example, Bing is good for finding a variety of pictures with generally similar shapes. TinEye is exceptional at finding variations of the same picture. Google may identify what is in the picture, and Karma Decay helps determine what social media is saying about the picture.
Pictures passed from blog to blog and across online forums are typically resaved. Each resave reduces the quality of the image. The best analysis results will come from the highest quality picture. If the original picture is not available, then a near-original (one or two saves from the original) is likely a good option.
Assuming that the image is available online, how can you find an original (or near-original) picture? The answer typically requires finding the highest quality picture.
Perceptual search services permit analysts to view results by size or similarity. Unfortunately, none of these search engines sort results by quality. When searching for visually-similar images, there are a couple of attributes that can help identify higher quality pictures. Some attributes are easily identifiable from the search results, while others may require additional analysis. Although these guidelines are not always true, these heuristics are typically good enough to identify a higher quality image:
Dimensions. The largest picture is usually the highest quality image. This isn't always true since small pictures may be scaled larger, but most pictures are scaled smaller (and not larger) for the web. Significantly smaller pictures are unlikely to be at a higher quality.
File size. If two pictures have the same dimensions, then the one with the largest file size is likely a higher quality picture. This heuristic works because lower quality JPEG files result in smaller files. However, PNG and BMP files are almost always larger sizes than corresponding JPEGs. File size should be compared against similar file formats.
Cropping. Look at the edges of the picture. It is easy to remove content from the edges of an image, but very difficult to add in content. A picture that has more content along the edges is usually a higher quality picture.
Padding. Look for borders around the image. It might be a thick frame, a single-pixel box, or a subtle drop-shadow. Pictures without borders are typically a higher quality than a similar picture with borders. The act of saving the picture (after adding borders) will lower the picture's quality.
Blur and noise. Lower quality JPEG images typically appear blurry. Attempts to compensate for a low quality blur usually include sharpening the picture, resulting in pixelated noise over the image. A picture that appears blurrier or noisier is likely at a lower quality. (While original pictures from digital cameras do contain sensor noise, this natural artifact is typically not noticable to the human eye.)
Attribution. Look for logos, watermarks, and copyright text. These are usually found in the corners or along edges of the pictures. Pictures without attributions are usually a higher quality than a similar picture with attributions. Adding an attribution to a picture obscures content, and saving as a JPEG after annotating lowers the quality.
Metadata. Files with metadata are likely at a higher quality than files without metadata. At minimum, the metadata may identify how the picture was handled.
Source. Online services such as Facebook, Imgur, and Twitter do not create pictures; they only redistribute pictures. The redistributed file is typically modified and may be resaved at a low quality (as is the case with Facebook and Twitter). News outlets typically use pictures from professional photo sites, such as Reuters, the Associated Press, or Getty Images. In general, identify whether the hosting site is an authoritative source, personal web site, or social network.
Age. Online pictures are redistributed over time, and shared pictures are usually modified (cropped, resized, recolored, resaved, etc.). Older pictures are effectively frozen in time, while younger pictures have had more time to be passed around and modified. As a result, older pictures are usually higher quality images.
When searching for textual content related to a picture, it is valuable to identify the picture's quality. A higher quality picture is usually associated with more authoritative text.
In general, try to identify a time period when the picture first appeared and was discussed. With viral pictures, there may be multiple clusters of discussion. These clusters appear each time a different online group discovers the picture (if it is new to them, they will discuss it as a cluster, even if it is an old picture). For example, Karma Decay may identify multiple threads at Reddit that discuss a picture. Each thread may denote a different time period where someone discovered the image.
Attributing context to an image varies based on the perceptual search engine. Google Image Search will attempt to associate common text to similar pictures; Google may immediately identify the content or context associated with the picture. Karma Decay will identify discussion threads at Reddit that typically provide context. In contrast, TinEye only identifies web sites that host the picture. You may need to visit multiple web sites in order to identify the context.
Similar pictures may also identify distribution patterns. For example, if variations of a picture are widely found on social networking sites, then it is likely a widely discussed topic. If variants are only found on Thai web sites, then the person who generated the variant that you are evaluating may be able to read Thai or may have ties to Thailand.
When identifying textual context, be wary of hoaxes and conspiracy theories. An established hoax/conspiracy typically results in contradicting textual descriptions. One description supports the concept, a different description debunks the issue, and a third may provide the initial story. In these cases, the amount of text and age of the text is typically independent of the ground truth. (There may be more articles around a hoax, but that does not mean it is real.) Do not assume that the initial story, or the most repeated explanation, is accurate. With hoaxes and conspiracies, look for cited sources and identifiable experts; unspecified sources and anonymous experts who are only identified by online handles are unlikely to be authoritative sources.
Similar image searches are valuable for identifying context, variants, distribution, and information related to a picture. However, search results may not always yield authoritative information. In particular:
Not every picture is distributed online or indexed by search engines. You may not find the picture you are looking for.
Viral pictures that are distributed through social media (e.g., Facebook or Twitter) may result in hundreds of variants -- pictures that look similar but have different MD5/SHA1 hash values. These variants result in search noise that may obscure the initial source. Viral pictures may not have an identifiable origin.
The camera-original source may not be online. You may not be able to identify the initial source for a picture, a high quality variant, or even an authoritative source.
For composite pictures, the individual components may not be identifiable.
Perceptual searches return similar pictures. Similar does not mean identical. A recreation of a photo should look similar to the original. People with similar physical qualities will likely look similar, and pictures of people in similar poses will look similar. Most search engines can sort results by the degree of similarity. Do not be surprised if only a few pictures look similar to you before diverging into visually dissimilar pictures. If the algorithm focuses on color, then do not be surprised if pictures have similar colors but completely different content.
Similar image search is only one evaluation approach. The interpretation of the results may be inconclusive. It is important to validate findings with other analysis techniques and algorithms.
Using Similar Picture Search
At FotoForensics, each analysis page has a search button: .
Clicking on search button will display a list of available search engines. Selecting a search name will upload a picture to the perceptual search service. Results from external services (TinEye, Google, etc.) are shown in a separate web browser window.
For TinEye, Google, and other image search services that are outside of FotoForensics, the picture is uploaded from your web browser to the online service. FotoForensics does not operate as middleman for the request and offers you no anonymity. If the picture's contents are sensitive in nature or not legal for you to distribute then do not click on the links to the similar search services. Clicking on the link is distribution and it is directly attributable to your web browser.