International Symposium on Chinese Spoken Language Processing (ISCSLP 2000)
Fragrant Hill Hotel, Beijing
This paper presents an effective and robust speech endpoint detection method based on 1/f process technique, which is suitable for robust continuous speech recognition system in variable noisy environments. The Gaussian 1/f process, which is a mathematical model for statistically self-similar random processes from fractals, is selected to model both speech and background noise. Then, an optimal Bayesian two-class classifier is developed to discriminate between real noisy speech and background noise by the wavelet coefficients with Karhunen-Loeve-type properties of the 1/f processes. Finally, for robust requirement, a few templates are built for speech and the parameters of the background noise can be dynamically adapted in runtime to deal with the variation of both speech and noise. In our experiments, 10 minutes long speech with different types of noises was tested using this new endpoint detector. A high performance with over 90% detection accuracy was achieved.
This paper presents an online unsupervised learning algorithm to flexibly adapt the speaker-independent (SI) hidden Markov models (HMMs) to new speaker. We apply the quasi-Bayes (QB) estimate to incrementally obtain word sequence and adaptation parameters for adjusting HMMs once a block of unlabeled data is enrolled. Accordingly, the nonstationary statistics of varying speakers can be successively traced according to the newest enrollment data. To improve the QB estimate, we employ the adaptive initial hyperparameters in the beginning session of online learning. These hyperparameters are estimated from a cluster of training speakers closest to the test speaker. Additionally, we develop a selection process to select reliable parameters from a list of candidates for unsupervised learning. A set of reliability assessment criteria is explored. From the experiments, we confirm the effectiveness of proposed method and find that using the adaptive initial hyperparameters in online learning and the multiple assessments in parameter selection can improve the speaker adaptation performance.
In this paper, a bottom-up integration structure to model
tone influence at various levels is proposed. At acoustic level, pitch is extracted as a
continuous acoustic variable. At phonetic level, we treat the main vowel with different
tones as different phonemes. In triphone building phase, we evaluated a set of questions
about tone for each decision tree node. At word level, a set of tone change rules was used
to build transcription for
training data and word lattice for decoding. At sentence level, some sentence ending words with light tone are added to system vocabulary. Integration at these five levels experimentally drops the word error rate from 9.9 to 7.8 on a Chinese continuous speech dictation task.
In this paper we present the drawbacks of conventional
approaches to the estimation of n-gram in Chinese natural language processing, that is,
the optimization of n-gram parameters is independent of its discriminative capability. To fight with this problem, we bring up with discriminative estimation criterion, on which the parameters of n-grams can be optimized. We implement this approach on the platform of the conversion from Chinese pinyin to Chinese character. We conduct experiments based on the tagged text corpus by Peking University. Experimental results show that the conversion rate can be remarkably raised by at most 41.4%.
Syllable to word decoding plays a very important role in
Chinese large vocabulary continuous speech recognition (LVCSR). However, lack of word
boundary and other characteristics of Chinese language prohibit the development of high
quality language model and decoder. In this paper,
we present a multi-path search algorithm with language model optimization and automatic lexicon augmentation method to improve the accuracy of syllable to word decoding. The experiment result shows that our method achieves 34.76% character accuracy improvement over the baseline
This paper describes a novel approach of compressing large trigram language models, which uses scalar quantization to compress log probabilities and back-off coefficients, and incremental coding to compress entry pointers. Experiments show that the new approach achieves roughly 2.5 times of compression ratio compared to the well-known tree-bucket format while keeps the perplexity and accessing speed almost unchanged. The high compression ratio enables our method to be used in various SLM-based applications such as Pinyin input method and dictation on handheld devices with little available memory.