Skip to contentSkip to navigation
iCampus home
Project

iCampus
projects
themes
news
gallery

Spoken Lecture Processing

Dates

September 2003 — December 2006

Principal Investigator

Dr. James Glass (Computer Science & Artificial Intelligence Laboratory)

Problem

We are entering an era where it is easier than ever to record, disseminate, and browse vast amounts of audio-visual material. Conventional text-based retrieval engines cannot process these types of data however, so searching through video can be very tedious. Manual annotations of these data are possible, but expensive solutions. What is needed are automated processing methods to provide structure to help people navigate through this growing data type.

Goal

The goal of this project is to enable educators and students to more effectively disseminate audio and video recordings of academic lecture material. To do this, we are developing technologies such as automatic speech recognition and language processing to help transcribe, annotate, structure, and even summarize audio-visual materials to help people search and explore these kind of data more easily. Our particular focus has been on recorded lectures that are being made available via initiatives such as MIT OpenCourseWare and MITWorld, in order to improve their accessibility to students or anyone interesting in learning from these educational materials.

Overview

Our research on this project has focused on generating accurate transcripts of spoken lectures, and their use in information retrieval applications. Accurate processing of recorded lectures poses many significant research challenges. First, the speaking style of lecturers in the classroom tends to be very spontaneous. Thus, the recorded speech contains many hesitations, mispronunciations, partial words, and other such artifacts that occur in natural human communication. Because the classroom speaking style is quite different from the kinds of speech that have been studied in the past, the transcription error rates can be very high, as much as essentially making a mistake every other word. The good news is that the errors tend to occur on common words. Thus, when searching for keywords, you can usually find the relevant lecture segments. The other good news is that because there tends to be a lot of data from the lecturers, ranging from one hour to over 30 hours, speech recognizers can adapt their acoustic models to the lecturers voice, and improve their accuracy. In many cases, the performance can be improved to the point where the automatically generated transcript is fairly comprehensible.

Another significant challenge to processing lectures is that they often contain very specialized words that do not commonly occur in everyday language (e.g., eigenvalue). In the context of a particular lecture however, they can be crucial terms that users will want to search for. Thus, it is important to adapt recognizer vocabularies to any parallel text materials such as slides, class notes, or even text books. These materials can have a dramatic impact on the overall performance of the speech recognizer. However, there is a mismatch between written and spoken materials so text-based data is only valuable up to a certain point. Ultimately this information must be combined with more general models of spoken language trained on the lecture speaking style.

The results of our research are being showcased in two different ways. We are developing a web-based prototype that allows users to browse MIT lectures and MITWorld seminars that have been processed automatically to produce an estimate of what was said in the lecture. Users can search for concepts that they are interested in, much like a regular search engine. The results that come back show the different lectures where there are "hits" and show the contexts of the ways the keywords were used in the lecture. If any of them look relevant the user can then play the video starting at the relevant point and see the aligned transcript scrolling along with the video. Although the transcript is not perfect, it can potentially be valuable for the hearing impaired.

Another of our prototypes is a web-based spoken lecture processing server that allows users to upload audio files for automatic transcription and indexing. To help the speech recognizer, users can provide their own supplemental text files, such as journal articles, book chapters, etc., which can be used to adapt the language model and vocabulary of the system.

Publications

Conference and Workshop Papers

J. Glass, T. Hazen, L. Hetherington, and C. Wang, "Analysis and Processing of Lecture Audio Data: Preliminary Investigations," Proc. Human Language Technology NAACL, Speech Indexing Workshop, 9-12, Boston, May 2004.

A. Park, T. Hazen, and J. Glass, "Automatic Processing of Audio Lectures for Information Retrieval: Vocabulary Selection and Language Modeling," Proc. Int. Conf. on Acoustics, Speech, and Signal Proc., Philadelphia, PA, March 2005.

J. Glass, T. Hazen, S. Cyphers, K. Schutte, and A. Park, "The MIT Spoken Lecture Processing Project," Proc. HLT/EMNLP, Vancouver, October 2005.

J. Glass, T. Hazen, S. Cyphers, I. Malioutov, and R. Barzilay, "Progress in Spoken Lecture Processing," submitted to Interspeech, Pittsburgh, September 2006.

Lab Abstracts

J. Glass, T. Hazen, L. Hetherington, and C. Wang, "Analysis and Processing of Lecture Audio Data: Preliminary Investigations," CSAIL abstract, 2004.

A. Park, T. Hazen, and J. Glass, "Automatic Processing of Audio Lectures for Information Retrieval," CSAIL abstract, 2005.

S. Cyphers, T. Hazen, and J. Glass, "Tools for Automatic Transcription and Browsing of Audio-Visual Presentations," CSAIL abstract, 2005.

T. Hazen, "Automatic Alignment and Error Correction of Human Generated Transcripts for Long Speech Recordings," CSAIL abstract, 2006.

J. Glass, T. Hazen, S. Cyphers, and J. Katz-Brown, "The MIT Spoken Lecture Processing Project," CSAIL abstract, 2006.

Theses and Proposals

A. Park, Pattern discovery for unsupervised processing of continuous speech, Ph.D. thesis proposal, Dept. of EECS, MIT, 2005.

Links

Spoken Lecture Processing Server page

picture of lecture recorded for Spoken Lecture Processing

 


Microsoft
MIT home

site last updated: July 20, 2006