Corpus-based speech synthesis performance depends on the skill to model and represent appropriately all the characteristics of the speech units that serve as a basis for concatenation. Although there is usually general agreement in the set of essential features (fundamental frequency, duration, power and phonetic context), it is still an open question the proper way of modelling them and considering their respective contributions to the cost functions, specially with regards to those related to the phonetic context. Precisely, this paper presents a new approach for modeling the phonetic context that also simplifies the hard task of training the corresponding weights to the different features in the target cost function.
Cite as: Campillo Díaz, F., Alba, J.L., Rodríguez Banga, E. (2005) A neural network approach for the design of the target cost function in unit-selection speech synthesis. Proc. Interspeech 2005, 2533-2536, doi: 10.21437/Interspeech.2005-788
@inproceedings{campillodiaz05_interspeech, author={Francisco {Campillo Díaz} and José Luis Alba and Eduardo {Rodríguez Banga}}, title={{A neural network approach for the design of the target cost function in unit-selection speech synthesis}}, year=2005, booktitle={Proc. Interspeech 2005}, pages={2533--2536}, doi={10.21437/Interspeech.2005-788} }