Fourth International Workshop on Spoken Language Technologies for Under-Resourced Languages (SLTU-2014)

St. Petersburg, Russia
May 14-16, 2014

Recent Improvements in Estonian LVCSR

Tanel Alumäe

Institute of Cybernetics at Tallinn Technical University, Estonia

This paper describes our current automatic transcription system for Estonian semi-spontaneous speech that we are developing within the Estonian language technology national program. A three pass decoding strategy is employed, with speaker-independent GMM acoustic models used in the first pass and speaker-adapted DNN-HMM models in the last pass. A neural network based phone duration model is used to rescore recognition lattices after the final pass and is found to give a surprisingly large gain in recognition accuracy. Compound words are split before building a statistical language model, and reconstructed from recognized hypotheses using an n-gram model. The word error rate of our system is 17.9% on broadcast conversations and 26.3% on conference speeches. This is around 8% absolute (24-30% relative) improvement compared to a GMM-based system of 2012.

Index Terms: Speech recognition, LVCSR, DNN, duration model, Estonian

Full Paper

Bibliographic reference.  Alumäe, Tanel (2014): "Recent improvements in Estonian LVCSR", In SLTU-2014, 118-123.