Sixth European Conference on Speech Communication and Technology
Input features that capture speech dynamics have frequently been proposed to improve recognition accuracy. A broad class of such features can be obtained by applying a linear projection to a window spanning successive feature vectors. The linear projection can be directly compared to conventional modeling schemes when it is optimized according to a maximum likelihood criterion. On a large acoustic training database of conversational telephone speech, phoneme errors were reduced by 5.5% and word errors by 6% using maximum likelihood temporal features. Smaller databases were subject to undertraining and no significant improvements in error rates were observed.
Full Paper (PDF) Gnu-Zipped Postscript
Bibliographic reference. Boulianne, Gilles / Brousseau, Julie / Talbot, Nathalie / Dumouchel, Pierre (1999): "Experiments in constrained maximum likelihood extraction of temporal features for speech recognition", In EUROSPEECH'99, 1083-1086.