ISCA Archive MSDR 2003
ISCA Archive MSDR 2003

Information access in large spoken archives

Martin Franz, Bhuvana Ramabhadran, Todd Ward, Michael Picheny

Digital archives have emerged as the pre-eminent method for capturing the human experience. Before such archives can be used efficiently, their contents must be described. The scale of such archives along with the associated content mark up cost make it impractical to provide access via purely manualmeans, but automatic technologies for search in spoken materials still have relatively limited capabilities. The NSF-funded MALACH project will use the worldÂ’s largest digital archive of video oral histories, collected by the Survivors of the Shoah VisualHistory Foundation (VHF) to make a quantum leap in the ability to access such archives by advancing the state-of-the-art in Automated Speech Recognition (ASR), Natural Language Processing (NLP) and related technologies. This corpus consists of over 115,000 hours of unconstrained, natural speech from 52,000 speakers in 32 different languages, filled with disfluencies, heavy accents, age-related coarticulations, and uncued speaker and language switching. This paper discusses some of the ASR and NLP tools and technologies that we have been building for the English speech in the MALACH corpus. We also discuss this new test bed while emphasizing the unique characteristics of this corpus.

Cite as: Franz, M., Ramabhadran, B., Ward, T., Picheny, M. (2003) Information access in large spoken archives. Proc. ISCA Workshop on Multilingual Spoken Document Retrieval (MSDR 2003), 37-42

  author={Martin Franz and Bhuvana Ramabhadran and Todd Ward and Michael Picheny},
  title={{Information access in large spoken archives}},
  booktitle={Proc. ISCA Workshop on Multilingual Spoken Document Retrieval (MSDR 2003)},