 |
2003 ISCA Workshop on
Multilingual Spoken Document Retrieval
(MSDR2003)
Hong Kong
April 4-5, 2003 |
 |
Information Access in Large Spoken Archives
Martin Franz, Bhuvana Ramabhadran, Todd Ward, Michael Picheny
IBM Thomas J. Watson Research Center, Yorktown Heights, NY, USA
Digital archives have emerged as the pre-eminent method
for capturing the human experience. Before such archives
can be used efficiently, their contents must be described.
The scale of such archives along with the associated content
mark up cost make it impractical to provide access via
purely manualmeans, but automatic technologies for search
in spoken materials still have relatively limited capabilities.
The NSF-funded MALACH project will use the world’s
largest digital archive of video oral histories, collected by
the Survivors of the Shoah VisualHistory Foundation (VHF)
to make a quantum leap in the ability to access such archives
by advancing the state-of-the-art in Automated Speech
Recognition (ASR), Natural Language Processing (NLP)
and related technologies. This corpus consists of
over 115,000 hours of unconstrained, natural speech from
52,000 speakers in 32 different languages, filled with disfluencies,
heavy accents, age-related coarticulations, and uncued
speaker and language switching. This paper discusses
some of the ASR and NLP tools and technologies that we
have been building for the English speech in the MALACH
corpus. We also discuss this new test bed while emphasizing
the unique characteristics of this corpus.
Full Paper
Bibliographic reference.
Franz, Martin / Ramabhadran, Bhuvana / Ward, Todd / Picheny, Michael (2003):
"Information access in large spoken archives",
In MSDR-2003, 37-42.