September 19, 2008
Searching the Invisible - Advances in Video and Audio Search

The Iceberg’s Tip
Since their inception, search engines have relied on the visible to locate relevant content. Visible text, to be precise. And not just any visible text, either - the text had to be accessible and readable to a web crawler. That is, it couldn’t be inside an animation, image, script, video, or a wide assortment of other file formats. It couldn’t be stored in a ‘deep web’ database such as the CDC or USGS, one that was reached only via an active user query. And it certainly couldn’t be spoken.
This meant that, for all the blow-your-mind number of visible web pages out there on the web, all this time only a tiny fraction of the available content has been indexable and searchable by search engines. In 2001 the company BrightPlanet estimated in their white paper “The Deep Web: Surfacing Hidden Value” that public search engines made only 0.03% of the total web content available to searchers. That’s tiny. And this estimate wasn’t even considering content hidden in images, audio files, and video. Like a giant iceberg of data and content, the majority of the web remained - and still remains - invisible to search. But this is all changing.
Making sense out of sound
Google’s beta release of Gaudi, its audio indexing tool heralds to the wider public a profound shift in the search environment. Why? Hasn’t audio search been around for years now? Actually, no. Not this way.Yahoo’s audio search tool and video search tool have been around for quite a while, yes, along with Google’s video search tool. But these tools merely help you locate audio and video files - they don’t index the meaningful content within those files. Words that are spoken inside these files couldn’t be indexed. Neither could faces, locations, or other visual content information. Which means in order to attach real meaning and searchability to the files they had to be tagged with identifying information or else have an accompanying summary. Text, in other words.
Instead of relying on tags, meta data, or transcripts, Gaudi is capable of indexing the actual spoken content within audio and video files. If a speaker says the word “orbit” in a video, a search for the word “orbit” using Gaudi will be able to locate that video file, even if it has no tags, meta data, or nearby text containing that word.Why is this important? Aren’t tags, meta data, and transcripts good enough?
A Question of Quality
It shouldn’t surprise anyone to learn that most images, videos, and audio files on the web have insufficient or inaccurate information associated with them.Here are some of the reasons why:
- It takes time, effort, money, and planning to add ALT properties to images
- Same goes for adding meta data to videos
- attempts to leverage “crowd intelligence” to tag these files often results in inaccuracies or less-specific information (i.e. “dog” instead of “poodle”)
- tagging and meta data can only summarize content
- Many sites don’t have room for - or don’t want to display - complete transcripts or even text summaries of video and audio files
So, to summarize, we humans are fairly messy, lazy, and careless when it comes to identifying things - at minimum, underfunded - and that’s not going to change. Much better to develop an automated process to make sense of it all for us, calculate relevance, and serve it up in a familiar search results environment.
But Can You See Me? I Mean, Really See Me?
Behind the scenes, Google, Yahoo, and a wide range of other organizations have been hard at work figuring out how to crack the visual, as well as audio, indexing challenge. Google foreshadowed their visual video indexing capability in a blog post on June 14, 2007, saying,
“The technology extracts key visual aspects of uploaded videos and compares that information against reference material provided by copyright holders.”
The main purpose of this effort was to identify copyright violations on the YouTube platform, and it followed on the heels of Google implementing audio fingerprinting technology on their AudioSwap platform. Once developed, however, the technology has far-reaching capabilities and implications for extending visual search capabilities.
One intriguing recent development on the visual search side of things - and there are many - is the creation of face-recognition software that works even with low-quality images and video clips, like those found all over YouTube and the web. Developed by researchers at Carnegie Mellon, this face-recognition system solves one of the challenges of extracting information from image and video files - that of file resolution, poor lighting, non-controlled subject aspect (which way the subject’s face is turned), and overall image quality. Combined with other visual recognition software and image annotation methods, this technology will likely lead to widespread automated indexing of at least a portion of that other 99.97% of the web.
OK, We’ve Got the Kitchen Sink - Now What Do We Do With It?
But information is one thing — making it meaningful is another. With access to more data from more sources than ever, the question is what good will it do us? How do we tie together related information that’s stored in a variety of formats, locations, and languages? How do we not only locate and index this data but correlate it in useful and intelligent ways? Well, now we’re talking about the semantic web, called by some - especially those like myself who’re really tired of talking about Web 2.0 - “Web 3.0″. And that, my friends, is the topic for a whole other blog post.









