International Symposium on Chinese Spoken Language Processing (ISCSLP 2000)
Fragrant Hill Hotel, Beijing
In this paper, a method for the Context-Independent (CI) Chinese Initial-Final acoustic modeling for continuous speech recognition task is proposed. The initial-final (I/F) structure is a characteristic of Chinese language. Initials and finals are smaller units compared to syllables, the use of which is helpful to reduce the number of SRUs. Furthermore, it should be possible to build context-dependent (CD) models. In our experiments, we use knowledge-based criteria to define the CI initial-final units. There are four kinds of CI initial-final units in this paper. The experimental results show that the accuracy of the CI initial-final models is near to or lower than that of the CI syllable model, but the size of model is significantly reduced.
The study investigates the spectral characteristics of the vowels in Cantonese. Results show that (1)the vowels in the (C) V:S syllables undershoot in the formant frequencies relative to the canonical target formant pattern associated with the same vowels in the (C)V:syllables;(2)the center formant frequency values for the vowels in the (C)VS syllables are not representative of the quality of the vowels due to short vowel duration; and (3) the center formant frequencies for the vowels in the (C)V: and (C)V:S syllables can be useful in terms of vowel transcription.
A corpus linguistic study is reported in this paper, guided
by articulatory phonology and by general phonetic principles of speech production. A
direct application of this study is the construction of Hidden Markov Model topologies for
automatic speech recognition, taking into account integrated multilingualism with the
consideration of the common physiological organs and processes involved in the production
of speech sounds from the worlds languages. We demonstrate in this study that
incorporation of speech production principles can provide effective constraints on
modeling for the purpose of building language-universal speech recognizers.
Modeling pronunciation variation in spontaneous speech is
very important for improving the recognition accuracy. One limitation of current
recognition systems is their dictionaries for recognition only contain one standard
pronunciation for each entry, so that the amount of variability that can be modeled is
very limited. In this paper, we proposed to generate pronunciation networks based on rules
to instead of traditional dictionary for
decoder. The networks consider the special structure of Chinese and incorporate acceptable variants of each Chinese syllable . Also, an automatically
learning algorithm is designed to get the variation rules. The proposed method was experimented on Hub4NE 1997 Mandarin Broadcast News Corpus
and HLTC stack decoder. The syllable recognition error rate was reduced 3.20% absolutely with both intra- and inter-syllable variations are both modeled.
Prosodic word and its prominence and prosodic phrase are examined in this experiment. And the prominence in prosodic word is related to stress. It seems to us that the hierarchical stress in sentence spoken is one of intonational cues in Chinese. Tone and intonation in Chinese are two different phonological events in spoken sentence.
This paper will particularly introduce a read and a spontaneous speech corpus to show how to collect and annotate the task dependent speech corpora. Additionally, segmental labeling convention SAMPA-C and prosodic labeling convention C-ToBI are depicted. Finally, known and new results are given or compared for these two annotated corpora.