Tag Archives: Joining Glass

Lead to fully automated speech recognition

Speech recognition systems, such as those that convert speech to text on cellphones, are generally the result of machine learning. A computer pores through thousands or even millions of audio files and their transcriptions, and learns which acoustic features correspond to which typed words.

But transcribing recordings is costly, time-consuming work, which has limited speech recognition to a small subset of languages spoken in wealthy nations.

At the Neural Information Processing Systems conference this week, researchers from MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) are presenting a new approach to training speech-recognition systems that doesn’t depend on transcription. Instead, their system analyzes correspondences between images and spoken descriptions of those images, as captured in a large collection of audio recordings. The system then learns which acoustic features of the recordings correlate with which image characteristics.

“The goal of this work is to try to get the machine to learn language more like the way humans do,” says Jim Glass, a senior research scientist at CSAIL and a co-author on the paper describing the new system. “The current methods that people use to train up speech recognizers are very supervised. You get an utterance, and you’re told what’s said. And you do this for a large body of data.

“Big advances have been made — Siri, Google — but it’s expensive to get those annotations, and people have thus focused on, really, the major languages of the world. There are 7,000 languages, and I think less than 2 percent have ASR [automatic speech recognition] capability, and probably nothing is going to be done to address the others. So if you’re trying to think about how technology can be beneficial for society at large, it’s interesting to think about what we need to do to change the current situation. And the approach we’ve been taking through the years is looking at what we can learn with less supervision.”

Joining Glass on the paper are first author David Harwath, a graduate student in electrical engineering and computer science (EECS) at MIT; and Antonio Torralba, an EECS professor.

Visual semantics

The version of the system reported in the new paper doesn’t correlate recorded speech with written text; instead, it correlates speech with groups of thematically related images. But that correlation could serve as the basis for others.

If, for instance, an utterance is associated with a particular class of images, and the images have text terms associated with them, it should be possible to find a likely transcription of the utterance, all without human intervention. Similarly, a class of images with associated text terms in different languages could provide a way to do automatic translation.

Conversely, text terms associated with similar clusters of images, such as, say, “storm” and “clouds,”  could be inferred to have related meanings. Because the system in some sense learns words’ meanings — the images associated with them — and not just their sounds, it has a wider range of potential applications than a standard speech recognition system.

To test their system, the researchers used a database of 1,000 images, each of which had a recording of a free-form verbal description associated with it. They would feed their system one of the recordings and ask it to retrieve the 10 images that best matched it. That set of 10 images would contain the correct one 31 percent of the time.