We describe a multi-stage speech production model containing a linear, phoneme-independent coarticulation filter, followed by a nonlinear component. The latter generates two cepstra which are then additively combined: one corresponding to a relatively smooth background spectrum, and the other representing three formant-like spectral peaks. A neural net is used for both parts, but the second part also utilizes a hard-coded function that generates exactly three spectral peaks. A unified model of training, adaptation, and decoding is developed, each operation diering only with respect to prior probability distributions. Prior probabilities can be introduced at each stage of the model, providing a flexible framework for utilizing both specific and general prior knowledge. We demonstrate the use of this model for speech synthesis as well as recognition.
Cite as: Gao, Y., Bakis, R., Huang, J., Xiang, B. (2000) Multistage coarticulation model combining articulatory, formant and cepstral features. Proc. 6th International Conference on Spoken Language Processing (ICSLP 2000), vol. 1, 25-28, doi: 10.21437/ICSLP.2000-7
@inproceedings{gao00_icslp, author={Yuqing Gao and Raimo Bakis and Jing Huang and Bing Xiang}, title={{Multistage coarticulation model combining articulatory, formant and cepstral features}}, year=2000, booktitle={Proc. 6th International Conference on Spoken Language Processing (ICSLP 2000)}, pages={vol. 1, 25-28}, doi={10.21437/ICSLP.2000-7} }