Tutorial: Similar Picture Search

Pictures often circulate the Internet. One person will see a picture online and cross-post it to another forum, where someone else will take a copy and distribute it through other online services.

Each time someone submits a picture to another online service, information may be lost and the image may be modified. Additional changes to the picture, such as scaling, cropping, or color adjustments, further alter the image. All of these changes impact the overall quality of the picture. Metadata and image artifacts may identify a low quality picture that has been altered by multiple online services. However, multiple resaves and edits may make it impossible to determine if a person was pasted into a picture.

Forensic investigators may have a low quality picture, or an image without any context. Online image search services permit analysts to find variations of the same pictures. The search results may identify high quality versions, distribution patterns, and circumstances that can provide context to the picture.

About Perceptual Searches
Identifying Quality
Identifying Context

About Perceptual Searches

Most picture search engines take a text description as input and return a series of pictures that match the text description. If the context around the picture is known, then a wide range of online text-to-image search services are available. Unfortunately, an analyst may spend hours doing keyword searches and manually reviewing results, and may still not find the desired image.

The alternative to text-to-image search is to use an image-to-image (reverse image) search engine. This approach uses algorithms to identify similar pictures.

There are different approaches for finding visually similar images. The most direct approach is to use a cryptographic hash function, such as MD5 or SHA1. The same hash value identifies the exact same picture. However, a single byte change will result in significantly different hash values. Even though pictures may visually look the same, different bytes yield different hashes and are therefore different pictures.

Digital picture analysis typically relies on perceptual hashes for reverse image searches. A perceptual hash is an algorithm that will yield similar hash values for visually-similar pictures. There are a wide variety of perceptual hash functions. Different algorithms may focus on colors, edges, corners, 'blobs', or frequency patterns. In general, these types of algorithms can match similar pictures, even if there are significant size, quality, and coloring differences. They may also match pictures with minor differences in content, cropping, and rotation.

While there are many different perceptual hash algorithms, only a few perceptual hash search engines are publicly available. These services allow users to upload pictures as the search criteria. Rather than using words to find pictures, they permit the use of a picture to find similar pictures. The results include web pages that host variations of the picture. A few examples of perceptual search engines include:

TinEye: This search engine is exceptional at finding partial matches. Although it may not know immediately about new pictures, it will usually identify widely circulated pictures, as well as images from news outlets.
Google Image Search: This search engine has indexed most of the pictures found by Google, including late-breaking pictures that are only a few hours old. It is exceptional at identifying textual content related to pictures. However, Google Image Search is not strong at identifying significant variations from cropping, splicing, or editing.
Bing: Microsoft's Bing includes a search engine that matches based on similar shapes. However, it does not have a large corpus of indexed images and it usually does not find variations of the same picture. If the picture has a large rectangular region in the middle, then it will usually find other pictures with large rectangular regions in the middle. This is useful for finding visually similar images without finding variations of the exact same picture.
Karma Decay: This specialized search engine matches against all pictures that have appeared on the Reddit social network. This is useful for identifying topics such as memes and controversial current events.

Each of these search engines serves a different purpose. For example, Bing is good for finding a variety of pictures with generally similar shapes, identifying known people, and extracting text. TinEye is exceptional at finding variations of the same picture. Google may identify what is in the picture, and Karma Decay helps determine what social media is saying about the picture.

These are not the only reverse image search engines. Some services (not listed here) have very extreme focuses, such as only searching for Anime images. Others, like Yandex and Baidu, have been observed providing malicious JavaScript, extreme user tracking, and changing the type of data they collect based on where you are located. (We do not recommend using any search engine that has been observed providing hostile code to users.)

Identifying Quality

Pictures passed from blog to blog and across online forums are typically resaved. Each resave reduces the quality of the image. The best analysis results will come from the highest quality picture. If the original picture is not available, then a near-original (one or two saves from the original) is likely a good option.

Assuming that the image is available online, how can you find an original (or near-original) picture? The answer typically requires finding the highest quality picture.

Perceptual search services permit analysts to view results by size or similarity. Unfortunately, none of these search engines sort results by quality. When searching for visually-similar images, there are a couple of attributes that can help identify higher quality pictures. Some attributes are easily identifiable from the search results, while others may require additional analysis. Although these guidelines are not always true, these heuristics are typically good enough to identify a higher quality image:

Heuristic	Explanation
Dimensions	The largest picture is usually the highest quality image. This isn't always true since small pictures may be scaled larger, but most pictures are scaled smaller (and not larger) for the web. Significantly smaller pictures are unlikely to be at a higher quality.
File size	If two pictures have the same dimensions, then the one with the largest file size is likely a higher quality picture. This heuristic works because lower quality JPEG files result in smaller files. However, PNG and BMP files are almost always larger sizes than corresponding JPEGs. File size should be compared against similar file formats.
Cropping	Look at the edges of the picture. It is easy to remove content from the edges of an image, but very difficult to add in content. A picture that has more content along the edges is usually a higher quality picture.
Padding	Look for borders around the image. It might be a thick frame, a single-pixel box, or a subtle drop-shadow. Pictures without borders are typically a higher quality than a similar picture with borders. The act of saving the picture (after adding borders) will lower the picture's quality.
Blur and noise	Lower quality JPEG images typically appear blurry. Attempts to compensate for a low quality blur usually include sharpening the picture, resulting in pixelated noise over the image. A picture that appears blurrier or noisier is likely at a lower quality. (While original pictures from digital cameras do contain sensor noise, this natural artifact is typically not noticable to the human eye.)
Attribution	Look for logos, watermarks, and copyright text. These are usually found in the corners or along edges of the pictures. Pictures without attributions are usually a higher quality than a similar picture with attributions. Adding an attribution to a picture obscures content, and saving as a JPEG after annotating lowers the quality.
Metadata	Files with metadata are likely at a higher quality than files without metadata. At minimum, the metadata may identify how the picture was handled.
Source	Online services such as Facebook, Imgur, and Twitter do not create pictures; they only redistribute pictures. The redistributed file is typically modified and may be resaved at a low quality (as is the case with Facebook and Twitter). News outlets typically use pictures from professional photo sites, such as Reuters, the Associated Press, or Getty Images. In general, identify whether the hosting site is an authoritative source, personal web site, or social network.
Age	Online pictures are redistributed over time, and shared pictures are usually modified (cropped, resized, recolored, resaved, etc.). Older pictures are effectively frozen in time, while younger pictures have had more time to be passed around and modified. As a result, older pictures are usually higher quality images.

Identifying Context

When searching for textual content related to a picture, it is valuable to identify the picture's quality. A higher quality picture is usually associated with more authoritative text.

In general, try to identify a time period when the picture first appeared and was discussed. With viral pictures, there may be multiple clusters of discussion. These clusters appear each time a different online group discovers the picture (if it is new to them, they will discuss it as a cluster, even if it is an old picture). For example, Karma Decay may identify multiple threads at Reddit that discuss a picture. Each thread may denote a different time period where someone discovered the image.

Attributing context to an image varies based on the perceptual search engine. Google Image Search will attempt to associate common text to similar pictures; Google may immediately identify the content or context associated with the picture. Karma Decay will identify discussion threads at Reddit that typically provide context. In contrast, TinEye only identifies web sites that host the picture. You may need to visit multiple web sites in order to identify the context.

Similar pictures may also identify distribution patterns. For example, if variations of a picture are widely found on social networking sites, then it is likely a widely discussed topic. If variants are only found on Thai web sites, then the person who generated the variant that you are evaluating may be able to read Thai or may have ties to Thailand.

When identifying textual context, be wary of hoaxes and conspiracy theories. An established hoax/conspiracy typically results in contradicting textual descriptions. One description supports the concept, a different description debunks the issue, and a third may provide the initial story. In these cases, the amount of text and age of the text is typically independent of the ground truth. (There may be more articles around a hoax, but that does not mean it is real.) Do not assume that the initial story, or the most repeated explanation, is accurate. With hoaxes and conspiracies, look for cited sources and identifiable experts; unspecified sources and anonymous experts who are only identified by online handles are unlikely to be authoritative sources.

Using Similar Picture Search

At FotoForensics, each analysis page has a search button: . Clicking on search button will display a list of available search engines. Selecting a search name will upload a picture to the perceptual search service. Results from external services are shown in a separate web browser window.

Regarding Privacy

For Bing, Google, and other image search services that are external to FotoForensics, the picture is uploaded from your web browser to the online service. FotoForensics does not operate as middleman for the request and offers you no anonymity. If the picture's contents are sensitive in nature or not legal for you to distribute then do not click on the links to the similar search services. Clicking on the link is distribution and it is directly attributable to your web browser.

Search Limitations

Similar image searches are valuable for identifying context, variants, distribution, and information related to a picture. However, search results may not always yield authoritative information. In particular:

Not every picture is distributed online or indexed by search engines. You may not find the picture you are looking for.
Viral pictures that are distributed through social media (e.g., Facebook or Twitter) may result in hundreds of variants -- pictures that look similar but have different MD5/SHA1 hash values. These variants result in search noise that may obscure the initial source. Viral pictures may not have an identifiable origin.
The camera-original source may not be online. You may not be able to identify the initial source for a picture, a high quality variant, or even an authoritative source.
For composite pictures, the individual components may not be identifiable.
Perceptual searches return similar pictures. Similar does not mean identical. A recreation of a photo should look similar to the original. People with similar physical qualities will likely look similar, and pictures of people in similar poses will look similar. Most search engines can sort results by the degree of similarity. Do not be surprised if only a few pictures look similar to you before diverging into visually dissimilar pictures. If the algorithm focuses on color, then do not be surprised if pictures have similar colors but completely different content.
"Visually similar" is not the same as object identification. If you copy a picture and apply minor alterations (e.g., minor edit, crop, scale, or recolor), then it can still be visually similar to the source image. However, if you take two pictures of an object from different angles (e.g., two pictures of a chair taken from different angles) are unlikely to match each other -- even though it is the same object. This is because a human can recognize the object, while the computer will recognize that they are not two versions of the same picture.

Caveats

Visually similar searches can be useful, but also have specific caveats. These include:

Limitation	Explanation
Source Image	Modifications to the image, including cropping, rotation, edits, and color alterations may make the search engine miss results. Even if two pictures look the same to you, the computer may think they are too different. The amount of acceptable difference is specific to each search engine.
Mobile Firefox	Google's search-by-image returns a very long URL that is truncated by Mobile Firefox. As a result, Mobile Firefox may not be able to use Google's image search system. FotoForensics attempts to identify this situation and does not permit Mobile Firefox to search using Google.
NoScript	Users who have the NoScript browser plugin installed may see an error message when searching with Google and other services. The issue is that some services perform a web page reload. This reload is outside of FotoForensics' control. NoScript's built-in "Unsafe Reload" function prevents the reload. If you see an error when searching with Google or TinEye, then use the NoScript menu to select "XSS" → "Unsafe Reload" to permit the search to complete.
External Services	Third-party service occasionally change their capabilities. For example: Google used to be extremely good at visually similar matches. However, Google has since decided to focus on generating descriptive phrases and then searching based on the phrase. Phrase-based searching results in fewer visually similar matches. Bing used to only find similar elements. For example, searching for a person in a black shirt would find other people in black shirts, but not the same image. But Microsoft has since greatly improved their matching capabilities. Bing can find visually similar images, similar object, and provide context for a picture. TinEye used to permit submissions from other web services, but no longer supports this functionality. If you want to use TinEye, you must go to tineye.com directly. External services may change without notice. Sometimes the change is an improvement, while other times it isn't.

Similar image search is only one evaluation approach. The interpretation of the results may be inconclusive. It is important to validate findings with other analysis techniques and algorithms.