Deep Metric Learning for the Target Cost in Unit-Selection Speech Synthesizer

Ruibo Fu, Jianhua Tao, Yibin Zheng, Zhengqi Wen


This paper describes a unified Deep Metric Learning (DML) framework to predict the target cost directly by supervised learning method. The conventional methods to calculate the target cost include two separate steps: feature extraction and standard distance measurement. The proposed DML framework aims to measure the similarity between the candidate units and the target units more reasonably and directly. Firstly, the symmetrical DML framework is pre-trained to learn the metric between pairs of candidate units and target units. The relabeling procedure is added to correct the initial designed labels of the target cost. Secondly, the acoustic features of the target units are removed, which fits the runtime of the unit-selection synthesizer. The asymmetrical DML is fine-tuned to learn the metric between candidate units and target units. Compared with the conventional methods, the proposed unified DML framework can avoid the accumulation of errors in separate steps and improve the accuracy in labeling and predicting the target cost. The evaluation results demonstrate that the naturalness of synthetic speech has been improved by adopting DML framework to predict target cost.


 DOI: 10.21437/Interspeech.2018-1305

Cite as: Fu, R., Tao, J., Zheng, Y., Wen, Z. (2018) Deep Metric Learning for the Target Cost in Unit-Selection Speech Synthesizer. Proc. Interspeech 2018, 2514-2518, DOI: 10.21437/Interspeech.2018-1305.


@inproceedings{Fu2018,
  author={Ruibo Fu and Jianhua Tao and Yibin Zheng and Zhengqi Wen},
  title={Deep Metric Learning for the Target Cost in Unit-Selection Speech Synthesizer},
  year=2018,
  booktitle={Proc. Interspeech 2018},
  pages={2514--2518},
  doi={10.21437/Interspeech.2018-1305},
  url={http://dx.doi.org/10.21437/Interspeech.2018-1305}
}