Segmental Encoder-Decoder Models for Large Vocabulary Automatic Speech Recognition

Eugen Beck, Mirko Hannemann, Patrick Dötsch, Ralf Schlüter, Hermann Ney


It has been known for a long time that the classic Hidden-Markov-Model (HMM) derivation for speech recognition contains assumptions such as independence of observation vectors and weak duration modeling that are practical but unrealistic. When using the hybrid approach this is amplified by trying to fit a discriminative model into a generative one. Hidden Conditional Random Fields (CRFs) and segmental models (e.g. Semi-Markov CRFs / Segmental CRFs) have been proposed as an alternative, but for a long time have failed to get traction until recently. In this paper we explore different length modeling approaches for segmental models, their relation to attention-based systems. Furthermore we show experimental results on a handwriting recognition task and to the best of our knowledge the first reported results on the Switchboard 300h speech recognition corpus using this approach.


 DOI: 10.21437/Interspeech.2018-1212

Cite as: Beck, E., Hannemann, M., Dötsch, P., Schlüter, R., Ney, H. (2018) Segmental Encoder-Decoder Models for Large Vocabulary Automatic Speech Recognition. Proc. Interspeech 2018, 766-770, DOI: 10.21437/Interspeech.2018-1212.


@inproceedings{Beck2018,
  author={Eugen Beck and Mirko Hannemann and Patrick Dötsch and Ralf Schlüter and Hermann Ney},
  title={Segmental Encoder-Decoder Models for Large Vocabulary Automatic Speech Recognition},
  year=2018,
  booktitle={Proc. Interspeech 2018},
  pages={766--770},
  doi={10.21437/Interspeech.2018-1212},
  url={http://dx.doi.org/10.21437/Interspeech.2018-1212}
}