This paper analyzes prosodic differences between a professional newscaster and amateur speakers which affect listeners' perceptual impression. Speech of professional newscasters easily convey his/her occupation, which is that of a newscaster. Although people perceive many factors from human's speech, it is not revealed what factors are dominant for him/her to be professional. To this end, we conduct a large scale perceptual experiment using synthesized speech by deep neural networks (DNN) based speech synthesis. Speech stimuli are synthesized, in which prosodic features such as phoneme duration or F0 are partially substituted with those of target speakers by changing a DNN trained from professional and amateur speakers. To exclude the influence of the voice quality, spectral features with the same speaker characteristics were used. Listeners are asked to choose one speech which he/she thought that it is more acceptable as speech of a newscaster. The results of the perceptual experiment indicate that listeners' impressions are affected by F0 rather than phoneme duration, although both features affect the listeners' impressions. We further analyze the relation between the obtained perceptual scores and some prosodic related features. It suggests that the larger the SD of F0 pattern, the more listeners perceive the speech as professional.
Cite as: Ozuru, T., Ijima, Y., Saito, D., Minematsu, N. (2020) Are you professional?: Analysis of prosodic features between a newscaster and amateur speakers through partial substitution by DNN-TTS. Proc. Speech Prosody 2020, 920-924, doi: 10.21437/SpeechProsody.2020-188
@inproceedings{ozuru20_speechprosody, author={Takuya Ozuru and Yusuke Ijima and Daisuke Saito and Nobuaki Minematsu}, title={{Are you professional?: Analysis of prosodic features between a newscaster and amateur speakers through partial substitution by DNN-TTS}}, year=2020, booktitle={Proc. Speech Prosody 2020}, pages={920--924}, doi={10.21437/SpeechProsody.2020-188} }