ISCA Archive SSW 2023
ISCA Archive SSW 2023

Improving robustness of spontaneous speech synthesis with linguistic speech regularization and pseudo-filled-pause insertion

Yuta Matsunaga, Takaaki Saeki, Shinnosuke Takamichi, Hiroshi Saruwatari

We present a training method with linguistic speech regularization that improves the robustness of spontaneous speech synthesis methods with filled pause (FP) insertion. Spontaneousspeech synthesis is aimed at producing speech with human-likedisfluencies, such as FPs. Because modeling the complex datadistribution of spontaneous speech with a rich FP vocabulary ischallenging, the quality of FP-inserted synthetic speech is oftenlimited. To address this issue, we present a method for synthesizing spontaneous speech that improves robustness to diverseFP insertions. Regularization is used to stabilize the synthesis ofthe linguistic speech (i.e., non-FP) elements. To further improverobustness to diverse FP insertions, it utilizes pseudo-FPs sampled using an FP word prediction model as well as ground-truthFPs. Our experiments demonstrated that the proposed methodimproves the naturalness of synthetic speech with ground-truthand predicted FPs by 0.24 and 0.26, respectively.


doi: 10.21437/SSW.2023-10

Cite as: Matsunaga, Y., Saeki, T., Takamichi, S., Saruwatari, H. (2023) Improving robustness of spontaneous speech synthesis with linguistic speech regularization and pseudo-filled-pause insertion. Proc. 12th ISCA Speech Synthesis Workshop (SSW2023), 62-68, doi: 10.21437/SSW.2023-10

@inproceedings{matsunaga23_ssw,
  author={Yuta Matsunaga and Takaaki Saeki and Shinnosuke Takamichi and Hiroshi Saruwatari},
  title={{Improving robustness of spontaneous speech synthesis with linguistic speech regularization and pseudo-filled-pause insertion}},
  year=2023,
  booktitle={Proc. 12th ISCA Speech Synthesis Workshop (SSW2023)},
  pages={62--68},
  doi={10.21437/SSW.2023-10}
}