We describe our work on incorporating probabilities of phone durations, learned by a neural net, into an ASR system. Phone durations are incorporated via lattice rescoring. The input features are derived from the phone identities of a context window of phones, plus the durations of preceding phones within that window. Unlike some previous work, our network outputs the probability of different durations (in frames) directly, up to a fixed limit. We evaluate this method on several large vocabulary tasks, and while we consistently see improvements inWord Error Rates, the improvements are smaller when the lattices are generated with neural net based acoustic models.
Cite as: Hadian, H., Povey, D., Sameti, H., Khudanpur, S. (2017) Phone Duration Modeling for LVCSR Using Neural Networks. Proc. Interspeech 2017, 518-522, doi: 10.21437/Interspeech.2017-1680
@inproceedings{hadian17_interspeech, author={Hossein Hadian and Daniel Povey and Hossein Sameti and Sanjeev Khudanpur}, title={{Phone Duration Modeling for LVCSR Using Neural Networks}}, year=2017, booktitle={Proc. Interspeech 2017}, pages={518--522}, doi={10.21437/Interspeech.2017-1680} }