This paper presents an approach using phonetic context similarity as a cost function in unit selection of concatenative Text-to- Speech. The approach measures the degree of similarity between the desired context and the candidate segment under different phonetic contexts. It considers the impact from relatively far contexts when plenty of candidates are available and can take advantage of the data from other symbolically different contexts when the candidates are sparse. Moreover, the cost function also provides an efficient way to prune the search space. Different parameters for modeling, normalization and integerization are discussed. MOS evaluation shows that it can improve the synthesis quality significantly.
Bibliographic reference. Zhang, Wei / Cui, Xiaodong (2010): "Applying scalable phonetic context similarity in unit selection of concatenative text-to-speech", In INTERSPEECH-2010, 154-157.