Recurrent Neural Network-Based Phoneme Sequence Estimation Using Multiple ASR Systems’ Outputs for Spoken Term Detection

Naoki Sawada, Hiromitsu Nishizaki


This paper describes a novel correct phoneme sequence estimation method that uses a recurrent neural network (RNN)-based framework for spoken term detection (STD). In an automatic speech recognition (ASR)-based STD framework, ASR performance (word or subword error rate) affects STD performance. Therefore, it is important to reduce ASR errors to obtain good STD results. In this study, we use an RNN-based phoneme estimator, which estimates a correct phoneme sequence of an utterance from some sorts of phoneme-based transcriptions produced by multiple ASR systems in post-processing, to reduce phoneme errors. With two types of test speech corpora, the proposed phoneme estimator obtained phoneme-based N-best transcriptions with fewer phoneme recognition errors than the N-best transcriptions from the best ASR system we prepared. In addition, the STD system with the RNN-based phoneme estimator drastically improved STD performance with two test collections for STD compared to our previously proposed STD system with a conditional random fields-based phoneme estimator.


DOI: 10.21437/Interspeech.2016-337

Cite as

Sawada, N., Nishizaki, H. (2016) Recurrent Neural Network-Based Phoneme Sequence Estimation Using Multiple ASR Systems’ Outputs for Spoken Term Detection. Proc. Interspeech 2016, 3688-3692.

Bibtex
@inproceedings{Sawada+2016,
author={Naoki Sawada and Hiromitsu Nishizaki},
title={Recurrent Neural Network-Based Phoneme Sequence Estimation Using Multiple ASR Systems’ Outputs for Spoken Term Detection},
year=2016,
booktitle={Interspeech 2016},
doi={10.21437/Interspeech.2016-337},
url={http://dx.doi.org/10.21437/Interspeech.2016-337},
pages={3688--3692}
}