Interspeech'2005 - Eurospeech

Lisbon, Portugal
September 4-8, 2005

Refining Phoneme Segmentations Using Speaker-Adaptive Context Dependent Boundary Models

Yong Zhao (1), Lijuan Wang (2), Min Chu (1), Frank K. Soong (1), Zhigang Cao (2)

(1) Microsoft Research Asia, Beijing, China; (2) Tsinghua University, Beijing, China

Consistent phoneme segmentation is essential in building high quality Text-to-Speech (TTS) voice fonts. In this paper we propose to adapt an existing well-trained Context Dependent Boundary Model (CDBM) for refining segment boundaries to a new speaker with limited, manually segmented data. Three adaptation approaches: MLLR, MAP, and a combination of the two, are studied. The combined one, MLLR+MAP, delivers the best boundary refinement performance. In comparison with other boundary segmentation methods, the adapted CDBM yields better results, especially with a limited amount of adaptation data. Given 400 manually segmented boundary tokens in about 20 sentences as a development set, the segmentation precision can reach 90% of human labeled boundaries within a tolerance of 20 ms.

Full Paper

Bibliographic reference.  Zhao, Yong / Wang, Lijuan / Chu, Min / Soong, Frank K. / Cao, Zhigang (2005): "Refining phoneme segmentations using speaker-adaptive context dependent boundary models", In INTERSPEECH-2005, 2557-2560.