Two tech news stories recently caught my eye that, when taken together, suggest that the future of mobile discovery lies with the visual search engine.
On November 6th, 2013, América Móvil, the Latin American telco giant led by billionaire Carlos Slim, led a massive $60 million investment round in mobile image and video sharing platform Mobli. Though eye-popping, the size of the investment was not nearly as interesting as its intent. As reported in TechCrunch, Mobli want to use the cash to launch and expand its visual search engine, enabling its users to “see the world through other people’s eyes.”
Fast forward exactly two months later, on January 6th, 2014, Pinterest announced the acquisition of startup VisualGraph, which creates machine vision, image recognition, and visual search technologies.
On Pinterest, millions of people are curating and sharing billions of Pins everyday. And these Pins are more than just images - they link to contents that can inspire and enrich people’s lives. We are excited for the opportunity to combine machine vision with human vision and curation, and to build a visual discovery experience that is both aesthetically appealing and immensely useful for people everywhere.
Ditto from the gang at Pinterest: "The acquisition of VisualGraph will help us build technology to better understand what people are Pinning. By doing so, we hope to make it easier for people to find the things they love."
Sensing an important story, these two stories compelled me to dig a little bit deeper into the idea of the visual search engine. After taking some time to research the topic, I am now convinced that the user shift to mobile will usher in a seismic shift to visual search. Here is a (brief) summary of my findings.
To the extent that there is a traditional approach to any form of image search, I suppose concept-based image indexing would be it. Also known as “description-based” or “text-based” image retrieval, this type of search refers to the indexing and retrieval of text-based images that may employ metadata such as keywords, subject headings, tags, captions, or natural language text. For years now, SEOs and digital marketers have been optimizing images so that search engines like Google could understand and properly index visual content such as images (and the written content often associated with it).
With CBIR, by contrast, search engines analyze the visual content of the image (pixels) rather than the metadata. In this sense, the idea of "content" may refer to colors, shapes, textures, or any other information that can be derived from the image itself.
CBIR is gaining popularity because of the inefficiencies and limitations inherent with metadata-based image retrieval. Optimizing for text-based retrieval can be time consuming and create unintended ambiguities (especially when you factor in the use of synonyms or homonyms); however, until recently, many image retrieval systems, such as Google-image search, were exclusively text based.
Reverse image search is a CBIR query technique that involves providing the search engine with a sample image to base its query on. Reverse image search allows users to discover content that is related to a specific sample image, popularity of an image, and discover manipulated versions and derivative works
Different implementations of CBIR make use of different types of user queries. Examples include Google Image search and Tin Eye.
Taken from the description of the app in Chrome Extension, Tin Eye claims to be the first image search engine on the web to use image identification technology rather than metadata:
When you submit an image to be searched, TinEye creates a unique and compact digital signature or 'fingerprint' for it, then compares this fingerprint to every other image in our index to retrieve matches. TinEye does not typically find similar images; it finds exact matches including those that have been cropped, edited or resized. TinEye adds tens of millions of new images to its database every week.
Here’s how Google describes its CBIR search engine:
Instead of typing words, you can use a picture as your search to find related images from around the web. For example, if you search using a picture of your favorite band, you can find similar images, websites about the band, and even sites that include the same picture…When you search by image, your results may include:
Image results for images that are similar to yours
Web results for pages that include matching images
Other sizes of the image you searched for
I decided to take both CBIR search engines for a quick spin. Here are the results of the first thing that came into my head (don’t ask):
NB: For those interested, here is a list of publicly available Content-based image retrieval engines.
CBIR systems make use of relevance feedback as the user refines the CBIR results by clicking on images that best capture the intent of the search. By doing so, the user is providing semantic context to the search engine, helping it to better “understand” the exact meaning of a given search query so it can produce results more efficiently in the future. The ability of the search engine to learn by context is essential to improving its semantic retrieval capability.
The human brain’s ability to retain and subsequently retrieve information in a context-relevant manner is critical to cognition. Doing so allows us to form a repository of semantic knowledge, to flexibly access information about concepts and objects to comprehend inputs and generate responses. In general, there are two ways to access semantic knowledge--automatic and controlled retrieval.
Automatic retrieval is a key aspect of higher-order thinking. It is non-conscious, involuntary and effortless. Controlled retrieval, on the other hand, is conscious, voluntary and intentional. It is also less efficient. Controlled retrieval requires conscious thought and analysis.
Let me give you an example. Take a second to step back and look at the computer, tablet or mobile device on which you are reading this post. Now imagine you had to communicate the meaning of this thing you are looking at to another person. If your goal was to communicate exactly what you are seeing down to the most minute detail - to really give the other person a clear understanding of what you are looking at - which would be faster, showing them a picture (image) of your device, or describing it verbally or in written form (anyone unsure of the answer to this should read a little Marcel Proust, a man known to take multiple pages to describe a slice of bread)?
Here’s my point. The efficiency of an automatic retrieval function is critical to higher level cognition and semantic learning; imagine how far we’d get in life if we had to constantly think about, or in a sense “re-learn” the meaning of everything around us.
However, thanks to technological advancements in machine vision and perception, speech recognition and language translation, computers are now able to mimic this kind of complex thinking. Nowhere was this more evident than in Google’s famous 2012 “cat” experiment, where Google scientists created one of the largest neural networks for machine learning by connecting 16,000 computer processers to see if it could recognize the concept of cat without any prompting; they never told the network, “this is a cat.” Regardless, the machine was able to essentially invent the concept of cat merely by viewing millions of pictures of cats until its understanding of “cat” became automatic.
Why is this significant? These types of semantic retrieval capabilities bring content-based image retrieval to a whole new level, paving the way for a more efficient and consumer-relevant form of search that is now being rolled out in Pinterest’s Visual Graph and Mobli’s visual search engine. As the world continues to go mobile and haptic, the greater convenience and efficiency of semantically intelligent visual search will become increasingly obvious, all while the utility of text-based search becomes less so. Such technology may also hasten the mass adoption of mobile augmented reality apps (think Amazon’s new iOS app as an example).
Reflecting on the commercial implications of Pinterest’s Visual Graph, Tech Crunch’s Josh Constine wrote: By making Pinterest easier to navigate through visual search, people could use it more like a shopping site than an inspiration discovery time sink. And where there’s search, there’s room for relevant ads that hit people who already have purchase intent — which could be very lucrative for Pinterest.
For mobile users and the businesses trying to reach them, this statement also serves as a compelling argument in favor of visual search.