A Comparative Study of Statistical Conversion of Face to Voice Based on Their Subjective Impressions

Yasuhito Ohsugi, Daisuke Saito, Nobuaki Minematsu


Recently, various types of Voice-based User Interfaces (VUIs) including smart speakers have been developed to be on the market. However, many of the VUIs use only synthetic voices to provide information for users. To realize a more natural interface, one feasible solution will be personifying VUIs by adding visual features such as face, but what kind of face is suited to a given quality of voice or what kind of voice quality is suited to a given face? In this paper, we test methods of statistical conversion from face to voice based on their subjective impressions. To this end, six combinations of two types of face features, one type of speech features, and three types of conversion models are tested using a parallel corpus developed based on subjective mapping from face features to voice features. The experimental results show that each subject judge one specific and subject-dependent voice quality as suited to different faces and that the optimal number of mixtures of face features is different from the numbers of mixtures of voice features tested.


 DOI: 10.21437/Interspeech.2018-2005

Cite as: Ohsugi, Y., Saito, D., Minematsu, N. (2018) A Comparative Study of Statistical Conversion of Face to Voice Based on Their Subjective Impressions. Proc. Interspeech 2018, 1001-1005, DOI: 10.21437/Interspeech.2018-2005.


@inproceedings{Ohsugi2018,
  author={Yasuhito Ohsugi and Daisuke Saito and Nobuaki Minematsu},
  title={A Comparative Study of Statistical Conversion of Face to Voice Based on Their Subjective Impressions},
  year=2018,
  booktitle={Proc. Interspeech 2018},
  pages={1001--1005},
  doi={10.21437/Interspeech.2018-2005},
  url={http://dx.doi.org/10.21437/Interspeech.2018-2005}
}