Discriminative segmental models, such as segmental conditional random fields (SCRFs), have been successfully applied to speech recognition recently in lattice rescoring to integrate detectors across different levels of units, such as phones and words. However, the lattice generation has been constrained by a baseline decoder, typically a frame-based hybrid HMM-DNN system, which still suffers from the well-known frame independent assumption. In this paper, we propose to use SCRFs with DNNs directly as the acoustic model, a one-pass unified framework that can utilize local phone classifiers, phone transitions and long-span features, in direct word decoding to model phones or sub-phonetic segments with variable length. We describe a WFST-based approach to utilize the proposed acoustic model efficiently with the language model in first-pass word recognition. Our evaluation on the WSJ0 corpus shows our SCRF-DNN system outperforms a hybrid HMM-DNN system and a frame-level CRF-DNN system using the same monophone label space.
Bibliographic reference. He, Yanzhang / Fosler-Lussier, Eric (2015): "Segmental conditional random fields with deep neural networks as acoustic models for first-pass word recognition", In INTERSPEECH-2015, 2640-2644.