A GPU-based WFST Decoder with Exact Lattice Generation

Zhehuai Chen, Justin Luitjens, Hainan Xu, Yiming Wang, Daniel Povey, Sanjeev Khudanpur


We describe initial work on an extension of the Kaldi toolkit that supports weighted finite-state transducer (WFST) decoding on Graphics Processing Units (GPUs). We implement token recombination as an atomic GPU operation in order to fully parallelize the Viterbi beam search and propose a dynamic load balancing strategy for more efficient token passing scheduling among GPU threads. We also redesign the exact lattice generation and lattice pruning algorithms for better utilization of the GPUs. Experiments on the Switchboard corpus show that the proposed method achieves identical 1-best results and lattice quality in recognition and confidence measure tasks, while running 3 to 15 times faster than the single process Kaldi decoder. The above results are reported on different GPU architectures. Additionally we obtain a 46-fold speedup with sequence parallelism and multi-process service (MPS) in GPU.


 DOI: 10.21437/Interspeech.2018-1339

Cite as: Chen, Z., Luitjens, J., Xu, H., Wang, Y., Povey, D., Khudanpur, S. (2018) A GPU-based WFST Decoder with Exact Lattice Generation. Proc. Interspeech 2018, 2212-2216, DOI: 10.21437/Interspeech.2018-1339.


@inproceedings{Chen2018,
  author={Zhehuai Chen and Justin Luitjens and Hainan Xu and Yiming Wang and Daniel Povey and Sanjeev Khudanpur},
  title={A GPU-based WFST Decoder with Exact Lattice Generation},
  year=2018,
  booktitle={Proc. Interspeech 2018},
  pages={2212--2216},
  doi={10.21437/Interspeech.2018-1339},
  url={http://dx.doi.org/10.21437/Interspeech.2018-1339}
}