In this paper, we present a comparative study between spontaneous speech and read Mandarin speech in the context of automatic speech recognition. We focus on analysis and modeling of prosodic features, based on a unique speech corpus that contains similar amounts of read and spontaneous speech data from the same group of speakers. Statistical analysis is carried out on tone contours and duration of syllable and sub-syllable units. Speech recognition experiments are performed to evaluate the effectiveness of different approaches to incorporate prosodic features into acoustic modeling. A key problem being addressed is how to deal with the unvoiced frames where F0 values are unavailable. We apply the technique of Multi-space distribution (MSD) to model partially continuous F0 contours. For spontaneous speech, the tonal-syllable error rate is reduced from the MFCC baseline of 64.8% to 59.4% with the MSD based prosody model. For read speech, the performance improves from 46.0% to 36.4%.
Bibliographic reference. Yeung, Yu Ting / Qian, Yao / Lee, Tan / Soong, Frank K. (2008): "Prosody for Mandarin speech recognition: a comparative study of read and spontaneous speech", In INTERSPEECH-2008, 1133-1136.