Research in speech recognition has been underway for decades, and a great deal of progress has been made in reducing the word error rate. However, recent studies still demonstrate that machine performance is still quite far from human performance across a wide variety of tasks, ranging from high-bandwidth digit recognition to large vocabulary telephony speech. In addition, for most speech recognition tasks, obtaining good performance relies on tuning to a particular domain or environment. For instance, a system trained on the Switchboard corpus is unlikely to provide close to optimal performance on a small vocabulary task such as telephone digits. As we begin to strive towards developing recognition systems that equal, or even surpass, human performance, it does not make sense to construct a system for each specific domain and environment. Consequently, our initial goal is to develop a generic speech recognition system that can deal with linguistically different, as well as acoustically different domains. In order to achieve this goal, we must combine advances in signal processing, language modeling, and acoustic modeling, with substantially enhanced training and testing data. In this paper, we outline new techniques to develop a generic system that can work on a multitude of domains and environments. We propose to train and benchmark this system using speech data from a variety of sources, representing a variety of linguistic domains, channels, and environments.
Cite as: Padmanabhan, M., Picheny, M. (2000) Towards super-human speech recognition. Proc. ASR2000 - Automatic Speech Recognition: Challenges for the New Millenium, 189-194
@inproceedings{padmanabhan00_asr, author={Mukund Padmanabhan and Michael Picheny}, title={{Towards super-human speech recognition}}, year=2000, booktitle={Proc. ASR2000 - Automatic Speech Recognition: Challenges for the New Millenium}, pages={189--194} }