ISCA Archive Interspeech 2021
ISCA Archive Interspeech 2021

Triple M: A Practical Text-to-Speech Synthesis System with Multi-Guidance Attention and Multi-Band Multi-Time LPCNet

Shilun Lin, Fenglong Xie, Li Meng, Xinhui Li, Li Lu

In this work, a robust and efficient text-to-speech (TTS) synthesis system named Triple M is proposed for large-scale online application. The key components of Triple M are: 1) A sequence-to-sequence model adopts a novel multi-guidance attention to transfer complementary advantages from guiding attention mechanisms to the basic attention mechanism without in-domain performance loss and online service modification. Compared with single attention mechanism, multi-guidance attention not only brings better naturalness to long sentence synthesis, but also reduces the word error rate by 26.8%. 2) A new efficient multi-band multi-time vocoder framework, which reduces the computational complexity from 2.8 to 1.0 GFLOP and speeds up LPCNet by 2.75× on a single CPU.


doi: 10.21437/Interspeech.2021-851

Cite as: Lin, S., Xie, F., Meng, L., Li, X., Lu, L. (2021) Triple M: A Practical Text-to-Speech Synthesis System with Multi-Guidance Attention and Multi-Band Multi-Time LPCNet. Proc. Interspeech 2021, 3640-3644, doi: 10.21437/Interspeech.2021-851

@inproceedings{lin21g_interspeech,
  author={Shilun Lin and Fenglong Xie and Li Meng and Xinhui Li and Li Lu},
  title={{Triple M: A Practical Text-to-Speech Synthesis System with Multi-Guidance Attention and Multi-Band Multi-Time LPCNet}},
  year=2021,
  booktitle={Proc. Interspeech 2021},
  pages={3640--3644},
  doi={10.21437/Interspeech.2021-851}
}