ISCA Archive SSW 2021
ISCA Archive SSW 2021

Audiobook Speech Synthesis Conditioned by Cross-Sentence Context-Aware Word Embeddings

Wataru Nakata, Tomoki Koriyama, Shinnosuke Takamichi, Naoko Tanji, Yusuke Ijima, Ryo Masumura, Hiroshi Saruwatari

This paper proposes an audiobook speech synthesis method that considers a wider range of contexts than a sentence level. The style of the audiobook speech depends not only on the current sentence to be synthesized but also on its neighboring sentences. Therefore, unlike conventional text-to-speech synthesis for isolated sentences, it is necessary to consider the context of the neighboring sentences. Our method utilizes cross-sentence context-aware word embedding, which is obtained by inputting the neighboring and current sentences into BERT. The speech synthesis model, Tacotron2, is conditioned by this word embedding in addition to the current sentence. Experimental results show that taking neighboring sentences into account significantly improves synthetic speech quality.


doi: 10.21437/SSW.2021-37

Cite as: Nakata, W., Koriyama, T., Takamichi, S., Tanji, N., Ijima, Y., Masumura, R., Saruwatari, H. (2021) Audiobook Speech Synthesis Conditioned by Cross-Sentence Context-Aware Word Embeddings. Proc. 11th ISCA Speech Synthesis Workshop (SSW 11), 211-215, doi: 10.21437/SSW.2021-37

@inproceedings{nakata21_ssw,
  author={Wataru Nakata and Tomoki Koriyama and Shinnosuke Takamichi and Naoko Tanji and Yusuke Ijima and Ryo Masumura and Hiroshi Saruwatari},
  title={{Audiobook Speech Synthesis Conditioned by Cross-Sentence Context-Aware Word Embeddings}},
  year=2021,
  booktitle={Proc. 11th ISCA Speech Synthesis Workshop (SSW 11)},
  pages={211--215},
  doi={10.21437/SSW.2021-37}
}