A novel acoustic modeling method for Chinese speech
recognition based on Intra-Syllable Dependent Phone (ISDP) set is proposed. The ISDP set
extends the traditional phone set based on the intra-syllable information of Chinese
phonetic knowledge. The acoustic models based on ISDP set (ISDPMs) have the following
features. First, they are suitable for the case of a rather small scale of training data.
Second, this scheme is an integration form of the tri-phone modeling and the syllable
modeling. The mixed Gaussian densities are used to describe the feature space of each ISDP
and the Viterbi algorithm is adopted for decoding process. In addition, the ISDP-syllable
search tree is designed and presented to reduce the decoding complexity. Our
Experimental result shows that the ISDP modeling is more flexible and faster than the syllable modeling meanwhile it causes no much performance reduction.
A novel auditory nerve model is described here which simulates the three types of auditory nerves existing in the auditory system. The inspiration of the model is the absence of the simulation of the different types of auditory nerves in current auditory models. Based on the previous work, three sub-models replace the prevailing single auditory nerve discharge model in the common peripheral auditory models. Three auditory features were extracted and applied in speech recognition experiments. The results show that three models have different noise-resistant properties and the model with large dynamic range is exceptionally robust in speech recognition.
This paper puts forward a kind of hierarchy text processing model aimed at Chinese TTS system, and defines corresponding hierarchy labeling system. The actual realization on the hierarchic processing is also given in detail, and the processing tactics on the sub-phrase layer is specially discussed.
A new syllable decomposition method, which uses the tone information of the main vowel in a syllable to distinguish the tone of the whole syllable, is proposed in this paper. Compared to the scheme, in which a syllable is decomposed into an initial and a final, and the tone information is carried on by the final, the new scheme reduces the number of phonemes in the phone set of a recognition system. It handles the syllabic languages especially the ones with complicated tonal phonology such as Cantonese successfully, and also can be generalized to other tonal languages. Experiments on both Cantonese and Mandarin to compare the performance of systems using these two schemes, lead to that the new scheme got a little bit better accuracy than the old one while reduces the number of phonemes dramatically in recognition system. Such a method is promising to be used in more real speech recognition system or product.
In this paper, a method of pitch contour modelling based on the hidden Markov model (HMM) states of an acoustic unit is presented. A pair of vectors is computed from the alignment of the speech data with the acoustic units HMM states. The pitch contour feature of the acoustic unit is represented by the vector pair so that the variants of the acoustic units pitch contour can be measured and compared. Using this model, pitch contour decision trees are constructed for phones in Mandarin from a single speakers continuous reading speech database. The trees are used in the Mandarin speech synthesis system, which is trained over the same database, to predict the pitch contour of a certain phone according to its phone context. The naturalness of the synthesized Mandarin speech is highly improved.
MAT (Mandarin speech data across Taiwan) is a telephone speech data collection project conducted in Taiwan during 1995 - 1998. Over 7000 speakers have provided the speech data through the public telephone systems. Its outcome is a series of MAT databases. The plentiful speech data in MAT databases are valuable materials for the study of properties of Mandarin spoken in Taiwan. Some particular properties would be of interest to linguists and also useful for identifying the accent of Taiwanese. In this paper, several prosodic features in MAT databases are investigated. They are the stress patterns of disyllabic words, the intensity and duration of syllable finals, and the pitch pattern of lexical tones.
This paper presents a series of experiments on sub-syllabic
unit selection across the two Chinese dialects Mandarin and Cantonese. Evaluations
based on syllable recognition using only acoustic information, and no lexical knowledge is incorporated. We use a variety of subsyllabic acoustic models, motivated by phonological and lingustic structures charactersitics of Chinese. Our results should provide a useful reference for work in large-vocabulary Chinese speech recognition, as well as related tasks, e.g. spoken document retrieval.
The Mandarin Chinese synthesis component of the Dresden
Speech Synthesizer DreSS is based on an inventory of syllabic units. The inventory
contains all Chinese syllables with the possible tones in up to three phonetic variations
for a correct modeling of the cross syllable coarticulation effects. In order to
improve the naturalness and fluency of the synthesized speech, the inventory was complemented with prosodic alternative units for non-accented syllables, especially for neutral tone particles. In this paper, two strategies of the generation of such units are compared ?the extraction from specially constructed carrier sentences and the extraction from read speech corpus of newspapers texts. The results of a listening test show the best performance for the units from carrier sentences.
A CART-Based stochastic model for prediction of prosodic phrase breaks from input text of Chinese is provided in this work. All the features used in this model are almost obtained automatically. A novel and efficient algorithmLLW algorithm is proposed here. Experiments demonstrate a high success rate of prosodic phrase breaks prediction from input sentences with little syntactic information(81% success rate, 6.1% false rate).
This paper presents the implementation of a Min-Nan
text-to-speech (TTS) system. The system is designed based on the same principle of
developing a Mandarin TTS system proposed previously. It takes 877 base-syllables as basic
synthesis units and uses a recurrent neural network (RNN) based prosody synthesizer to
generate proper prosodic parameters for synthesizing natural output speech. It is
implemented by software and runs in real-time on PC. An informal subjective listening test
confirmed that the system performed well. all synthetic speeches sounded well for
well-tokenized texts and fair for texts with automatic tokenization.
With the features of computer network based on packet switching mode, a new variable bitrate, subband speech coding scheme which is combined the CELP (Code Excited Linear Predictive Coding), vector quantization and wavelet decomposition techniques is proposed in this paper. This subband speech coding scheme based on CELP provides a flexible variable bitrate speech coding method and suits packet switching network. It allows the switch nodes to regulate or control the transmission bitrate of speech within a large flexible range actively. A lot of speech coding experiments show that the result is satisfying.
In this paper, we firstly analyse the trend of spoken language or speech learning at present. We then investigate the problems of applying speech technology in language learning and key techniques to be used. According to the features of Chinese spoken language, we propose the rules and methods when a Computer Assisted Language Learning (CALL) system for learning Chinese is designed or realized.
The multiple regression model was used to study the phoneme and syllable duration characteristics of mandarin Chinese. The source speech material is a phonetically balanced text corpus collected from newspapers and spoken by a professional female announcer. Since the syllable, in an Initial/Final format, was adopted as a basic synthesis unit in our Chinese TTS system, the investigations were taken on both Initial/Final and syllable bases. RMS error values of the model are 18.6, 36.9 and 43.1 ms for Initial, Final, and syllable, respectively. The results are quite close to those reported in literature, which may use different approaches, such as neural networks. In the multiple regression model, an interesting finding is that the factor of the following syllable is much larger than that of the preceding syllable. This evidence is further discussed by focusing into two-syllable words in the utterances. From our informal listening tests, we confirmed that this approach improves the naturalness of synthetic speech as compared to our previous rule-bases duration model.
In this paper, a method with sentence-wide optimization
consideration is proposed to generate a Mandarin sentence's pitch-contour. The developed
model is called the sentence pitch-contour HMM (SPC-HMM) due to its use of VQ (vector
quantization) and HMM (hidden Markov model). To construct an SPC-HMM, the pitch-contours
of the syllables from each training sentence are normalized on both time and pitch-height
first. The method for pitch-height
normalization is effective and newly developed here. After normalization, the pitch-contour of each training syllable is vector quantized. Then, the quantization code and lexical tones of adjacent syllables are combined to define the observation symbol sequences for HMM training. In the synthesis phase, when given a sentence and its relevant text-analysis information, the most probable observation sequence is generated by finding the sentence-wide largest probability path with a dynamic-programming based algorithm. We had conducted practical perception tests. It is found that the speech synthesized by using the sentence pitch-contour generated from out method is slightly better than uttered by an ordinary speaker. Besides,
the comprehensibility of the synthesized speech is also promoted.
Tone is an indispensable component in tonal language such as Chinese and other Asian languages. This paper presents a comprehensive study on the importance of lexical tones in Chinese dialects, namely Mandarin and Cantonese, in large-vocabulary continuous speech recognition (LVCSR) tasks. Based on the different tone accuracies, the improvement in recognition after incorporating tone information is examined. It is shown that in the best scanerio when searching a syllable lattice for character sequence with perfect tone information, an improvement in accuracy by 11.28% and 11.09% is achievable for Mandarin and Cantonese respectively. There is also an improvement of around 8.5% when searching perfect syllable sequence for characters using perfect tone imformation.
This paper presents a corpus-based approach for Cantonese text-to-speech synthesis. We make use of a large corpus of recordings of broadcast news over the radio. An acoustic inventory is built from speech segments extracted from this corpus. The extracted units are non-uniform in their linguistic lengths. More precisely they include lexical words and monosyllables with tones. The acoustic units are properly labeled with a set of linguistic attributes that mainly describe their phonetic and prosodic context. Speech synthesis is performed by simple concatenation of best-matching units available, without any kind of signal modification. The results of subjective listening test on a preliminary implementation of the proposed method are reported.
The prosodic pattern generation and prediction is more important for synthesizing natural-sounding speech reproduction of input Chinese text. In this paper, the typical pitch models are clustered from a large actual speech database firstly. Then we propose several methods including rough set method and Bayesian relief network on linguistic features selection, which can be directly used to predict pitch, energy, and duration patterns. A comparison between these two methods is proposed and to overcome each disadvantage, we combined the results of these two methods, and coded the most important features to Bayesian relief network firstly. After learning, some experiment shows the F0 model prediction based on the selected features is the same as original one for most pitches.