A Multi-Accent Acoustic Model Using Mixture of Experts for Speech Recognition

Abhinav Jain, Vishwanath P. Singh, Shakti P. Rath

A major challenge in Automatic Speech Recognition(ASR) systems is to handle speech from a diverse set of accents. A model trained using a single accent performs rather poorly when confronted with different accents. One of the solutions is a multi-condition model trained on all the accents. However the performance improvement in this approach might be rather limited. Otherwise, accent-specific models might be trained but they become impractical as number of accents increases. In this paper, we propose a novel acoustic model architecture based on Mixture of Experts (MoE) which works well on multiple accents without having the overhead of training separate models for separate accents. The work is based on our earlier work, termed as MixNet, where we showed performance improvement by separation of phonetic class distributions in the feature space. In this paper, we propose an architecture that helps to compensate phonetic and accent variabilities which helps in even better discrimination among the classes. These variabilities are learned in a joint frame-work, and produce consistent improvements over all the individual accents, amounting to an overall 18% relative improvement in accuracy compared to baseline trained in multi-condition style.

 DOI: 10.21437/Interspeech.2019-1667

Cite as: Jain, A., Singh, V.P., Rath, S.P. (2019) A Multi-Accent Acoustic Model Using Mixture of Experts for Speech Recognition. Proc. Interspeech 2019, 779-783, DOI: 10.21437/Interspeech.2019-1667.

  author={Abhinav Jain and Vishwanath P. Singh and Shakti P. Rath},
  title={{A Multi-Accent Acoustic Model Using Mixture of Experts for Speech Recognition}},
  booktitle={Proc. Interspeech 2019},