Despite recent advances in text-to-speech (TTS) technology, auto-narration of long-form content such as books remainsa challenge. The goal of this work is to enhance neural TTS tobe suitable for long-form content such as audiobooks. In addition to high quality, we aim to provide a compelling and engaging listening experience with expressivity that spans beyonda single sentence to a paragraph level so that the user can notonly follow the story but also enjoy listening to it. Towards thatgoal, we made four enhancements to our baseline TTS system:incorporation of BERT embeddings, explicit prosody prediction from text, long-context modeling over multiple sentences,and pre-training on long-form data. We propose an evaluationframework tailored to long-form content that evaluates the synthesis on segments spanning multiple paragraphs and focuseson elements such as comprehension, ease of listening, ability tokeep attention, and enjoyment. The evaluation results show thatthe proposed approach outperforms the baseline on all evaluatedmetrics, with an absolute 0.47 MOS gain in overall quality. Ablation studies further confirm the effectiveness of the proposedenhancements.
Cite as: Zhang, W., Yeh, C.-C., Beckman, W., Raitio, T., Rasipuram, R., Golipour, L., Winarsky, D. (2023) Audiobook synthesis with long-form neural text-to-speech. Proc. 12th ISCA Speech Synthesis Workshop (SSW2023), 139-143, doi: 10.21437/SSW.2023-22
@inproceedings{zhang23_ssw, author={Weicheng Zhang and Cheng-Chieh Yeh and Will Beckman and Tuomo Raitio and Ramya Rasipuram and Ladan Golipour and David Winarsky}, title={{Audiobook synthesis with long-form neural text-to-speech}}, year=2023, booktitle={Proc. 12th ISCA Speech Synthesis Workshop (SSW2023)}, pages={139--143}, doi={10.21437/SSW.2023-22} }