A WFST Framework for Single-Pass Multi-Stream Decoding

Sirui Xu, Eric Fosler-Lussier

Combining disparate automatic speech recognition systems has long been an important strategy to improve recognition accuracy. Typically, each system requires a separate decoder; final results are derived by combining hypotheses from multiple lattices, necessitating multiple passes of decoding. We propose a novel Weighted Finite State Transducer (WFST) framework for integrating disparate systems. Our framework is different from the current popular system combination techniques in that the combination is done in one-pass decoding and allows the flexibility to combine systems at different levels of the decoding pipeline. Initial experiments with the framework achieved comparable performance as MBR-based combination which is reported to outperform ROVER and Confusion Network Combination (CNC). In this paper, we describe our methodology and present pilot study results for combining systems that use different sets of acoustic models, 1) gender-dependent GMM models, 2) MFCC and PLP features with GMM models, 3) MFCC, PLP and Filter Bank features with DNN models, and 4) SNR-specific DNN acoustic models. For each experiment, we also compared the computation time of the combined systems with their corresponding baseline systems. Our results show encouraging benefits of using the proposed framework to improve recognition performance while reducing computation time.

DOI: 10.21437/Interspeech.2016-1307

Cite as

Xu, S., Fosler-Lussier, E. (2016) A WFST Framework for Single-Pass Multi-Stream Decoding. Proc. Interspeech 2016, 1908-1912.

author={Sirui Xu and Eric Fosler-Lussier},
title={A WFST Framework for Single-Pass Multi-Stream Decoding},
booktitle={Interspeech 2016},