![]() |
ISCA ArchiveInternational Symposium on Chinese Spoken Language Processing (ISCSLP 2000)Fragrant Hill Hotel, Beijing
|
![]() |
ABSTRACT
In this paper, we make a detailed analysis on the errors that may occur in a continuous speech recognition system, and define two sets of judge rules to perform the error analysis. Using these judge rules, we can efficiently find the most important factors that influence the performance of our speech recognition system and know how to improve it. The experimental results show that our judge rules have the ability to identify the types of errors in our system. They are also consistent with some conclusions drawn by other experiments.
ABSTRACT
In this paper our approach to the lexical tone recognition of Chinese continuous speech is presented. The Mixed Gaussian Continuous Probability Model (MGCPM) [1] is used for the tone modeling, and the quadric curve is adopted to simulate the Fundamental frequency (F0) contour, whose three coefficients are calculated and taken as the features of the tone models. The tone variety in continuous Chinese speech recognition is an issue that must be faced in the tone modeling. There are two kinds of tone varieties, the change from canonical one to non-canonical one without changing the pitch trend and that from one to another different one. In order to reduce the negative influence caused by the tone varieties, an iterative method is proposed to distinguish the syllables which have tone varieties and remove them from the whole training data, and then the Tone Variety Matrix (TVM) is introduced for improving the performance of tone models. Experiments have been done based on the continuous Chinese speech database named "863" database. The top1 and top2 accuracy for baseline MGCPM is 67% and 90%, while that for MGCPM incorporated with TVM is 70% and 92%.
ABSTRACT
In this paper, we present a frequency band threshold based on wavelet transform (FBT) noise cancellation method. The noise cancellation is enable to improve on the articulation of the speech. Although the edge information of the speech is very important for recognition system to use, most traditional noise cancellation methods based on spectrum analysis smooth these edges of the original speech. We hope to get a noise cancellation method that keeps these edges information. We knew that the performance of edge detection based on wavelet transform is very high. So we use wavelet transform for noise cancellation. Noise cancellation methods based on wavelet transform were referred to papers [1][2]. The method was given by paper [1] is not real-time. Hence this method is difficult to be used a practical system. Although the real-time property of the noise cancellation method was referred to paper [2] is perfect, the aural performance is defective. This method has a single threshold (ST). It ignored the difference of the frequency bands. FBT is presented by us in this paper possesses two characteristics as follow: (1) These thresholds depend on frequency bands. (2) These thresholds are self-adjusting. Based on two judgement standards---signal noise rate (impersonal standard) and the articulation of the speech (subjective standard), we did comparison experiments between FBT and ST. Although FBTs signal noise rate inferior to the STs, FBTs waveform distortion is less than STs and FBTs articulation of the speech is remarkable superior to the STs. We particularly analyzed the causes of the phenomena and did the comparison experiments of these two methods on the same speech recognition system. The conclusion is FBT is superior to ST.
ABSTRACT
The study of traditional phonetics indicates the shape of lips takes important effect on the articulations of consonants and vowels. [1]. AVSP (Audio-Visual Speech Processing) can improve the naturalness of synthetical speech and recognition rate of the speech recognition system. Especially in computer-synthesized face, the movements of lip-shape play a crucial role. The present research aims to theorize a system of lip-shape variety comparison of the phoneme system of Standard Chinese. A new terminology- viseme system is given to this system. A small-scale visual speech database was created firstly and the viseme system in Standard Chinese is concluded based on the database and through a series of statistics methods.
ABSTRACT
According speech recognition and speech synthesis research work progressing, speech application in Internet is more and more widely used. Speech technology was used in Internet such as voice browser in English and other language, voice mail and speech interactive web page etc. In this paper design and establishment of speech interactive HTTP Web page using Agent technology was presented, normal HTTP Web page function add speech interface, made it possible allowing people using speech access to the Internet. Accessing to the speech interactive Web paper can not only using normal GUI explore, but also can using PDA, Mobil phone and other device which haven't keyboard. Using speech interactive Web page can release people concerning Information such as weather forecast, stock and traffic information etc, which can help user using speech acquire information.
ABSTRACT
In this paper, an auto-attendant system using finite state grammar (FSG) based on a continuous speech recognition (CSR) model is introduced. However, by using two virtual garbage models, one is to match the leading extraneous speech before the key name and the other to match the tailing extraneous speech following the key name, we managed to reach a more flexible and robust auto-attendant system. The experiment result show that, in our auto attendant system (about 240 names), to the name only test set and the sentence test set 1 composed of sentences that FSG can recognize, the recognition rate of the keyword spotting system is almost the same as that of FSG. To the sentence test set 2 composed of sentences that undefined in the FSG the keyword spotting system outperforms the FSG system remarkably. Not affecting the recognition accuracy of name only test set and the sentence test set 1, task dependent keyword models cut off additional 20% of error rate comparing with task independent keyword models in the sentence test set 2.
ABSTRACT
Digit string recognition is required in many applications which need to recognize numbers such as telephone numbers, credit card numbers, date, etc. In order to design a high performance recog-nizer, duration information is explored in this study. In a Mandarin connected digit recognizer, insertion and deletion errors amount to more than two thirds of the total recognition errors because there exist two mono-phonemic digits and a heavily rhotacized vowel. In order to use duration information more efficiently, we propose a method to model context dependent word duration information and then incorporate it directly in the decoding algorithm. Experi-mental results show that this method reduces word error rate by as much as 32.1%.
ABSTRACT
In this paper, a divergence-based training algorithm is
proposed for model separation, where the relative divergence between models is derived
from Kullback-Leibler (KL) information. We attempt to improve the discriminative power of
existing model while the environment-matched training data is not available. It could be
applied to improve the model discrimination after model-based compensation technique is
performed for robust speech recognition. Traditionally, the model training is based on
data driven such as maximum likelihood (ML) estimation or discriminative training.
Compared to ML training, the minimum classification error (MCE) objective in
discriminative training leads significant gain in accuracy. We attempt to improve
model discrimination based on an approximate classification error analysis, relative
divergence. We found that the smaller the relative divergence is, the more discriminative
powers of the two models are. In the proposed algorithm, we try to directly obtain the
discriminant function for model training from the relative divergence. Thus, the model
parameters can be adjusted based on minimum relative divergence. Experimental results
demonstrate that the divergence-based model separation method can achieve better
recognition performance.
ABSTRACT
The advent of the World Wide Web has increased the
importance of Information Retrieval. Retrieval strategies assign a measure of similarity
between a
query and a document. We usually have a notion that the more often terms are found in both
the document and query, the more relevant the document is deemed to be the
query.[1] But how to retrieve relevant information from extremely large document
collections is not easy. This paper describes a new approach for adaptive information
retrieval based on fuzzy set. The system applied this approach can retrieve some relevant
documents from the document collection according to the topic that a user query.
ABSTRACT
In this paper, we put forward a time-domain female-male
voice conversion algorithm. This method mainly focuses on two acoustic features that are
thought to be the most important to speech individuality: pitch frequency and formant
frequencies. To change pitch frequency, we cut off or add the low amplitude parts of
speech signals in one pitch period. To change formants, according to the relationship
between zero-cross rate and formants, and basing on the semi-waveform vector database
which the former students formed during carrying out a speech waveform encoding algorithm,
we use DTW technology to find a semi-waveform vector in the database to substitute the
original semi-waveform. Experiments show that this algorithm
is feasible. The average pitch frequency ratio of female speech to male speech is about
1.5 and the average formant frequencies ratio of female to male is about 1.2. We also
found that the converted male voice is better than the converted female voice.
ABSTRACT
In this paper, we present an on-line auto attendant system, CCL eAttendant, which has been employed on the CCL/ITRI telephone network since January 2000. This system is composed of speech recognition, text-to-speech, computer-telephony integration, and HTML data importer modules. It is based on WinTel architecture and is built on a Pentium-III PC with MS-Windows NT and a Dialogic D/41Esc telephony board. CCL eAttendant enables people to find CCL employees' extension numbers and forward calls by speech.