5th International Conference on Spoken Language Processing
We describe techniques used in the development of a high-accuracy automatic annotation system designed to provide new voices for a concatenative speech synthesiser. We have used standard HMM-based "forced alignment" techniques and have concentrated on refining both acoustic and pronunciation modelling to achieve greater alignment accuracy. Acoustic models were improved by Bayesian speaker adaptation and the use of confidence measures from N-Best decodings to produce speaker dependent HMM's. Pronunciation modelling improvements involved the use of a large pronunciation dictionary containing multiple pronunciations for many words, use of pronunciation probabilities, accommodation of interword silences and using information derived from existing manual annotations to guide the recogniser during decoding. The system produces time-aligned phonetic alignments for UK accents in which the automatic and manual alignments agree on the segmental labelling 93% of the time and in which the boundaries have an r.m.s. error of 14.5 ms from the manual boundary.
Bibliographic reference. Cox, Stephen / Brady, Richard / Jackson, Peter (1998): "Techniques for accurate automatic annotation of speech waveforms", In ICSLP-1998, paper 0466.