This paper presents an automatic phrase boundary labeling method for speech synthesis database annotation using context-dependent hidden Markov models (CD-HMMs) and n-gram prior distributions. At training stage, CD-HMMs are built to describe the conditional distribution of acoustic features given phonetic label and phrase boundary. In addition, n-gram models are estimated to represent the prior distributions of the phrase boundaries to be predicted. At decoding stage, the CD-HMMs and n-gram models are combined to predict the phrase boundaries by Viterbi decoding under maximum a posteriori (MAP) criterion. In our experiments, the proposed method utilizing context-dependent bigram prior distributions improved the F-score of phrase boundary labeling from 72.2% to 79.6% on the Boston University Radio News Corpus (BURNC), and from 69.6% to 81.0% on the Blizzard Challenge 2007 database respectively, comparing with the method using only acoustic models.
Bibliographic reference. Chen, Qian / Ling, Zhen-Hua / Yang, Chen-Yu / Dai, Li-Rong (2015): "Automatic phrase boundary labeling of speech synthesis database using context-dependent HMMs and n-gram prior distributions", In INTERSPEECH-2015, 1581-1585.