ISCA Archive Interspeech 2021
ISCA Archive Interspeech 2021

Detection and Analysis of Attention Errors in Sequence-to-Sequence Text-to-Speech

Cassia Valentini-Botinhao, Simon King

Sequence-to-sequence speech synthesis models are notorious for gross errors such as skipping and repetition, commonly associated with failures in the attention mechanism. While a lot has been done to improve attention and decrease errors, this paper focuses instead on automatic error detection and analysis. We evaluated three objective metrics against error detection scores collected by human listening. All metrics were derived from the synthesised attention matrix alone and do not require a reference signal, relying on the expectation that errors occur when attention is dispersed or insufficient. Using one of this metrics as an analysis tool, we observed that gross errors are more likely to occur in longer sentences and in sentences with punctuation marks that indicate pause or break. We also found that mechanisms such as forcibly incremented attention have the potential for decreasing gross errors but to the detriment of naturalness. The results of the error detection evaluation revealed that two of the evaluated metrics were able to detect errors with a relatively high success rate, obtaining F-scores of up to 0.89 and 0.96.

doi: 10.21437/Interspeech.2021-286

Cite as: Valentini-Botinhao, C., King, S. (2021) Detection and Analysis of Attention Errors in Sequence-to-Sequence Text-to-Speech. Proc. Interspeech 2021, 2746-2750, doi: 10.21437/Interspeech.2021-286

  author={Cassia Valentini-Botinhao and Simon King},
  title={{Detection and Analysis of Attention Errors in Sequence-to-Sequence Text-to-Speech}},
  booktitle={Proc. Interspeech 2021},