Detecting Mismatch Between Speech and Transcription Using Cross-Modal Attention

Qiang Huang, Thomas Hain


In this paper, we propose to detect mismatches between speech and transcriptions using deep neural networks. Although it is generally assumed there are no mismatches in some speech related applications, it is hard to avoid the errors due to one reason or another. Moreover, the use of mismatched data probably leads to performance reduction when training a model. In our work, instead of detecting the errors by computing the distance between manual transcriptions and text strings obtained using a speech recogniser, we view mismatch detection as a classification task and merge speech and transcription features using deep neural networks. To enhance detection ability, we use cross-modal attention mechanism in our approach by learning the relevance between the features obtained from the two modalities. To evaluate the effectiveness of our approach, we test it on Factored WSJCAM0 by randomly setting three kinds of mismatch, word deletion, insertion or substitution. To test its robustness, we train our models using a small number of samples and detect mismatch with different number of words being removed, inserted, and substituted. In our experiments, the results show the use of our approach for mismatch detection is close to 80% on insertion and deletion and outperforms the baseline.


 DOI: 10.21437/Interspeech.2019-2125

Cite as: Huang, Q., Hain, T. (2019) Detecting Mismatch Between Speech and Transcription Using Cross-Modal Attention. Proc. Interspeech 2019, 584-588, DOI: 10.21437/Interspeech.2019-2125.


@inproceedings{Huang2019,
  author={Qiang Huang and Thomas Hain},
  title={{Detecting Mismatch Between Speech and Transcription Using Cross-Modal Attention}},
  year=2019,
  booktitle={Proc. Interspeech 2019},
  pages={584--588},
  doi={10.21437/Interspeech.2019-2125},
  url={http://dx.doi.org/10.21437/Interspeech.2019-2125}
}