Exploiting Visual Features Using Bayesian Gated Neural Networks for Disordered Speech Recognition

Shansong Liu, Shoukang Hu, Yi Wang, Jianwei Yu, Rongfeng Su, Xunying Liu, Helen Meng


Automatic speech recognition (ASR) for disordered speech is a challenging task. People with speech disorders such as dysarthria often have physical disabilities, leading to severe degradation of speech quality, highly variable voice characteristics and large mismatch against normal speech. It is also difficult to record large amounts of high quality audio-visual data for developing audio-visual speech recognition (AVSR) systems. To address these issues, a novel Bayesian gated neural network (BGNN) based AVSR approach is proposed. Speaker level Bayesian gated control of contributions from visual features allows a more robust fusion of audio and video modality. A posterior distribution over the gating parameters is used to model their uncertainty given limited and variable disordered speech data. Experiments conducted on the UASpeech dysarthric speech corpus suggest the proposed BGNN AVSR system consistently outperforms state-of-the-art deep neural network (DNN) baseline ASR and AVSR systems by 4.5% and 4.7% absolute (14.9% and 15.5% relative) in word error rate.


 DOI: 10.21437/Interspeech.2019-1536

Cite as: Liu, S., Hu, S., Wang, Y., Yu, J., Su, R., Liu, X., Meng, H. (2019) Exploiting Visual Features Using Bayesian Gated Neural Networks for Disordered Speech Recognition. Proc. Interspeech 2019, 4120-4124, DOI: 10.21437/Interspeech.2019-1536.


@inproceedings{Liu2019,
  author={Shansong Liu and Shoukang Hu and Yi Wang and Jianwei Yu and Rongfeng Su and Xunying Liu and Helen Meng},
  title={{Exploiting Visual Features Using Bayesian Gated Neural Networks for Disordered Speech Recognition}},
  year=2019,
  booktitle={Proc. Interspeech 2019},
  pages={4120--4124},
  doi={10.21437/Interspeech.2019-1536},
  url={http://dx.doi.org/10.21437/Interspeech.2019-1536}
}