ISCA Archive SSW 2021
ISCA Archive SSW 2021

FeatherTTS: Robust and Efficient attention based Neural TTS

Qiao Tian, Chao Liu, Zewang Zhang, Heng Lu, Linghui Chen, Bin Wei, Pujiang He, Shan Liu

Attention based neural TTS is elegant speech synthesis pipeline and has shown a powerful ability to generate natural speech. However, it is still not robust enough to meet the stability requirements for industrial products. Besides, it suffers from slow inference speed owning to the autoregressive generation process. In this work, we propose FeatherTTS, a robust and efficient attention-based neural TTS system. Firstly, we propose a novel Gaussian attention which utilizes interpretability of Gaussian attention and the strict monotonic property in TTS. By this method, we replace the commonly used stop token prediction architecture with attentive stop prediction. Secondly, we apply block sparsity on the autoregressive decoder to speed up speech synthesis. The experimental results show that our proposed FeatherTTS not only nearly eliminates the problem of word skipping, repeating in particularly hard texts and keep the naturalness of generated speech, but also speeds up acoustic feature generation by 3.5 times over Tacotron. Overall, the proposed FeatherTTS can be 35x faster than real-time on a single CPU.


doi: 10.21437/SSW.2021-35

Cite as: Tian, Q., Liu, C., Zhang, Z., Lu, H., Chen, L., Wei, B., He, P., Liu, S. (2021) FeatherTTS: Robust and Efficient attention based Neural TTS. Proc. 11th ISCA Speech Synthesis Workshop (SSW 11), 200-204, doi: 10.21437/SSW.2021-35

@inproceedings{tian21_ssw,
  author={Qiao Tian and Chao Liu and Zewang Zhang and Heng Lu and Linghui Chen and Bin Wei and Pujiang He and Shan Liu},
  title={{FeatherTTS: Robust and Efficient attention based Neural TTS}},
  year=2021,
  booktitle={Proc. 11th ISCA Speech Synthesis Workshop (SSW 11)},
  pages={200--204},
  doi={10.21437/SSW.2021-35}
}