Survey Talk: Modeling in Automatic Speech Recognition: Beyond Hidden Markov Models

Ralf Schlüter


The general architecture and modeling of the state-of-the-art statistical approach to automatic speech recognition (ASR) have not been challenged significantly for decades. The classical statistical approach to ASR is based on Bayes decision rule, a separation of acoustic and language modeling, hidden Markov modeling (HMM), and a search organization based on dynamic programming and hypothesis pruning methods. Even when artificial neural networks for acoustic modeling and language modeling started to considerably boost ASR performance, the general architecture of state-of-the-art ASR systems was not altered considerably. The hybrid deep neural network (DNN)/HMM approach, together with recurrent long short-term memory (LSTM) neural network language modeling currently marks the state-of-the-art on many tasks, covering a wide range of training set sizes. However, currently more and more alternative approaches occur, moving gradually towards so-called end-to-end approaches. Gradually, these novel end-to-end approaches replace explicit time alignment modeling and dedicated search space organization by more implicit, integrated neural-network based representations, while also dropping the separation between acoustic and language modeling. Corresponding approaches show promising results, especially using large training sets. In this presentation, an overview of current modeling approaches to ASR will be given, including variations of both HMM-based and end-to-end modeling.


Cite as: Schlüter, R. (2019) Survey Talk: Modeling in Automatic Speech Recognition: Beyond Hidden Markov Models. Proc. Interspeech 2019.


@inproceedings{Schlüter2019,
  author={Ralf Schlüter},
  title={{Survey Talk: Modeling in Automatic Speech Recognition: Beyond Hidden Markov Models}},
  year=2019,
  booktitle={Proc. Interspeech 2019}
}