ISCA Archive Interspeech 2021
ISCA Archive Interspeech 2021

Variable Frame Rate Acoustic Models Using Minimum Error Reinforcement Learning

Dongcheng Jiang, Chao Zhang, Philip C. Woodland

Frame selection in automatic speech recognition (ASR) systems can potentially improve the trade-off between speed and accuracy relative to fixed low frame rate methods. In this paper, a sequence training approach based on minimum error and reinforcement learning is proposed for a hybrid ASR system to operate at a variable frame rate, and uses a frame selection controller to predict the number of frames to skip before taking the next inference action. The controller is integrated into the acoustic model in a multi-task training framework as an additional regression task and the controller output can be used for distribution characterisation during reinforcement learning exploration. The reinforcement learning objective minimises a combined measure of the phone error and average frame rate. ASR experiments using British English multi-genre broadcast (MGB3) data show that the proposed approach achieved a smaller frame rate than using a fixed 1/3 low frame rate method and was able to reduce the word error rate relative to both fixed low frame rate and full frame rate systems.

doi: 10.21437/Interspeech.2021-2198

Cite as: Jiang, D., Zhang, C., Woodland, P.C. (2021) Variable Frame Rate Acoustic Models Using Minimum Error Reinforcement Learning. Proc. Interspeech 2021, 2601-2605, doi: 10.21437/Interspeech.2021-2198

  author={Dongcheng Jiang and Chao Zhang and Philip C. Woodland},
  title={{Variable Frame Rate Acoustic Models Using Minimum Error Reinforcement Learning}},
  booktitle={Proc. Interspeech 2021},