Eighth ISCA Workshop on Speech Synthesis
Barcelona, Catalonia, Spain
When using data retrieved from the internet to create new speech databases, the recording conditions can often be highly variable within and between sessions. This variance influences the overall performance of any automatic speech and text alignment techniques used to process this data. In this paper we discuss the use of speaker adaptation methods to address this issue. Starting from a baseline system for automatic sentence-level segmentation and speech and text alignment based on GMMs and grapheme HMMs, respectively, we employ Maximum A Posteriori (MAP) and Constrained Maximum Likelihood Linear Regression (CMLLR) techniques to model the variation in the data in order to increase the amount of confidently aligned speech. We tested 29 different scenarios, which include reverberation, 8 talker babble noise and white noise, each in various combinations and SNRs. Results show that the MAP-based segmentations performance is very much influenced by the noise type, as well as the presence or absence of reverberation. On the other hand, the CMLLR adaptation of the acoustic models gives an average 20% increase in the aligned data percentage for the majority of the studied scenarios. Index Terms: speech alignment, speech segmentation, adaptive training, CMLLR, MAP, VAD
Bibliographic reference. Mamiya, Yoshitaka / Stan, Adriana / Yamagishi, Junichi / Bell, Peter / Watts, Oliver / Clark, Robert A. J. / King, Simon (2013): "Using adaptation to improve speech transcription alignment in noisy and reverberant environments", In SSW8, 41-46.