In this paper, we describe a novel phone duration model that is used to improve the accuracy of a large vocabulary speech recognition system based on state-of-the-art speaker-adapted DNN acoustic models. The duration model calculates the probability density function of phone duration from phone's contextual features using a neural network which is then applied for word lattice rescoring. Experimental results are given for Estonian, English and Finnish transcription tasks. An absolute word error rate reduction of 0.81.4% is observed across all evaluation sets.
Bibliographic reference. Alumäe, Tanel (2014): "Neural network phone duration model for speech recognition", In INTERSPEECH-2014, 1204-1208.