Fourth International Workshop on Spoken Language Technologies for Under-Resourced Languages (SLTU-2014)
St. Petersburg, Russia
This paper describes our current automatic transcription system for Estonian semi-spontaneous speech that we are developing within the Estonian language technology national program. A three pass decoding strategy is employed, with speaker-independent GMM acoustic models used in the first pass and speaker-adapted DNN-HMM models in the last pass. A neural network based phone duration model is used to rescore recognition lattices after the final pass and is found to give a surprisingly large gain in recognition accuracy. Compound words are split before building a statistical language model, and reconstructed from recognized hypotheses using an n-gram model. The word error rate of our system is 17.9% on broadcast conversations and 26.3% on conference speeches. This is around 8% absolute (24-30% relative) improvement compared to a GMM-based system of 2012.
Index Terms: Speech recognition, LVCSR, DNN, duration model, Estonian
Bibliographic reference. Alumäe, Tanel (2014): "Recent improvements in Estonian LVCSR", In SLTU-2014, 118-123.