ISCA Archive ICSLP 2000
ISCA Archive ICSLP 2000

Stream confidence estimation for audio-visual speech recognition

Gerasimos Potamianos, Chalapathy Neti

We investigate the use of single modality confidence measures as a means of estimating adaptive, local weights for improved audio- visual automatic speech recognition. We limit our work to the toy problem of audio-visual phonetic classification by means of a two-stream Gaussian mixture model (GMM), where each stream models the class conditional audio- or visual-only observation probability, raised to an appropriate exponent. We consider such stream exponents as two-dimensional piecewise constant functions of the audio and visual stream local confidences, and we estimate them by minimizing the misclassification error on a held-out data set. Three stream confidence measures are investigated, namely the stream entropy, the n-best likelihood ratio average, and an n-best stream likelihood dispersion measure. The later results in superior audio-visual phonetic classification, as indicated by our experiments on a 260-subject, 40-hour long, large vocabulary, continuous speech audio-visual dataset. By using local, dispersion-based stream exponents, we achieve an additional 20% phone classification accuracy improvement over the improvement that global stream exponents add to clean audio- only phonetic classification. The performance of the algorithm however still falls significantly short of an "oracle" (cheating) confidence estimation scheme.


Cite as: Potamianos, G., Neti, C. (2000) Stream confidence estimation for audio-visual speech recognition. Proc. 6th International Conference on Spoken Language Processing (ICSLP 2000), vol. 3, 746-749

@inproceedings{potamianos00c_icslp,
  author={Gerasimos Potamianos and Chalapathy Neti},
  title={{Stream confidence estimation for audio-visual speech recognition}},
  year=2000,
  booktitle={Proc. 6th International Conference on Spoken Language Processing (ICSLP 2000)},
  pages={vol. 3, 746-749}
}