Frame selection in automatic speech recognition (ASR) systems can potentially improve the trade-off between speed and accuracy relative to fixed low frame rate methods. In this paper, a sequence training approach based on minimum error and reinforcement learning is proposed for a hybrid ASR system to operate at a variable frame rate, and uses a frame selection controller to predict the number of frames to skip before taking the next inference action. The controller is integrated into the acoustic model in a multi-task training framework as an additional regression task and the controller output can be used for distribution characterisation during reinforcement learning exploration. The reinforcement learning objective minimises a combined measure of the phone error and average frame rate. ASR experiments using British English multi-genre broadcast (MGB3) data show that the proposed approach achieved a smaller frame rate than using a fixed 1/3 low frame rate method and was able to reduce the word error rate relative to both fixed low frame rate and full frame rate systems.
Cite as: Jiang, D., Zhang, C., Woodland, P.C. (2021) Variable Frame Rate Acoustic Models Using Minimum Error Reinforcement Learning. Proc. Interspeech 2021, 2601-2605, doi: 10.21437/Interspeech.2021-2198
@inproceedings{jiang21b_interspeech, author={Dongcheng Jiang and Chao Zhang and Philip C. Woodland}, title={{Variable Frame Rate Acoustic Models Using Minimum Error Reinforcement Learning}}, year=2021, booktitle={Proc. Interspeech 2021}, pages={2601--2605}, doi={10.21437/Interspeech.2021-2198} }