ISCA Archive

International Symposium on Chinese Spoken Language Processing (ISCSLP 2000)

Fragrant Hill Hotel, Beijing
October 13-15, 2000

Session Poster A2


Acoustic Level Error Analysis in Continuous Speech Recognition

Authors: Chunhua LUO, Mingxing XU, Fang ZHENG
Affiliation: Center of Speech Technology, State Key Laboratory of Intelligent Technology and Systems,
Department of Computer Science & Technology, Tsinghua University, Beijing
Mailto: lijing@sp.cs.tsinghua.edu.cn

ABSTRACT

In this paper, we make a detailed analysis on the errors that may occur in a continuous speech recognition system, and define two sets of judge rules to perform the error analysis. Using these judge rules, we can efficiently find the most important factors that influence the performance of our speech recognition system and know how to improve it. The experimental results show that our judge rules have the ability to identify the types of errors in our system. They are also consistent with some conclusions drawn by other experiments.

Page 203


Tone Recognition of Chinese Continuous Speech

Authors: Guoliang ZHANG, Fang ZHENG, Wenhu WU
Affiliation: Center of Speech Technology, State Key Laboratory of Intelligent Technology and Systems,
Department of Computer Science & Technology, Tsinghua University, Beijing
Mailto: liang@sp.cs.tsinghua.edu.cn

ABSTRACT

In this paper our approach to the lexical tone recognition of Chinese continuous speech is presented. The Mixed Gaussian Continuous Probability Model (MGCPM) [1] is used for the tone modeling, and the quadric curve is adopted to simulate the Fundamental frequency (F0) contour, whose three coefficients are calculated and taken as the features of the tone models. The tone variety in continuous Chinese speech recognition is an issue that must be faced in the tone modeling. There are two kinds of tone varieties, the change from canonical one to non-canonical one without changing the pitch trend and that from one to another different one. In order to reduce the negative influence caused by the tone varieties, an iterative method is proposed to distinguish the syllables which have tone varieties and remove them from the whole training data, and then the Tone Variety Matrix (TVM) is introduced for improving the performance of tone models. Experiments have been done based on the continuous Chinese speech database named "863" database. The top1 and top2 accuracy for baseline MGCPM is 67% and 90%, while that for MGCPM incorporated with TVM is 70% and 92%.

Page 207


A Noise Cancellation Method Based on Wavelet Transform

Authors: Dali YANG, Mingxing XU, Wenhu WU, Fang ZHENG
Affiliation: Center of Speech Technology, State Key Laboratory of Intelligent Technology and Systems,
Department of Computer Science & Technology, Tsinghua University, Beijing
Mailto: ydl@sp.cs.tsinghua.edu.cn

ABSTRACT

In this paper, we present a frequency band threshold based on wavelet transform (FBT) noise cancellation method. The noise cancellation is enable to improve on the articulation of the speech. Although the edge information of the speech is very important for recognition system to use, most traditional noise cancellation methods based on spectrum analysis smooth these edges of the original speech. We hope to get a noise cancellation method that keeps these edges information. We knew that the performance of edge detection based on wavelet transform is very high. So we use wavelet transform for noise cancellation. Noise cancellation methods based on wavelet transform were referred to papers [1][2]. The method was given by paper [1] is not real-time. Hence this method is difficult to be used a practical system. Although the real-time property of the noise cancellation method was referred to paper [2] is perfect, the aural performance is defective. This method has a single threshold (ST). It ignored the difference of the frequency bands. FBT is presented by us in this paper possesses two characteristics as follow:  (1) These thresholds depend on frequency bands. (2) These thresholds are self-adjusting. Based on two judgement standards---signal noise rate (impersonal standard) and the articulation of the speech (subjective standard), we did comparison experiments between FBT and ST. Although FBT’s signal noise rate inferior to the ST’s, FBT’s waveform distortion is less than ST’s and FBT’s articulation of the speech is remarkable superior to the ST’s. We particularly analyzed the causes of the phenomena and did the comparison experiments of these two methods on the same speech recognition system. The conclusion is FBT is superior to ST.

Page 211


Primary Research on The Viseme System in Standard Chinese

Authors: Anhong WANG, Huaiqiao BAO, Jiayou CHEN
Affiliation: Dept. of Chinese Language & Literature, Peking University, Beijing
Speech Laboratory, Institute of Nationality, CASS, Beijing
Mailto: Wang.ah@fm365.com
hqbao@fm365.com
chenjy@nation.cass.net.cn

ABSTRACT

The study of traditional phonetics indicates the shape of lips takes important effect on the articulations of consonants and vowels. [1]. AVSP (Audio-Visual Speech Processing) can improve the naturalness of synthetical speech and recognition rate of the speech recognition system. Especially in computer-synthesized face, the movements of lip-shape play a crucial role. The present research aims to theorize a system of lip-shape variety comparison of the phoneme system of Standard Chinese. A new terminology-“ viseme system” is given to this system. A small-scale visual speech database was created firstly and the viseme system in Standard Chinese is concluded based on the database and through a series of statistics methods.

Page 215


Speech Interactive Web Page Designing and Implantation Based on Agent

Authors: Lin DONG, Biqin LIN, Bao-Zong YUAN
Affiliation: Institute of Information Science, North Jiaotong University, Beijing
Mailto: dong_lin@126.com

ABSTRACT

According speech recognition and speech synthesis research work  progressing, speech application in Internet is more and more widely used. Speech technology was used in Internet such as voice browser in English and other language, voice mail and speech interactive web page etc. In this paper design and establishment of speech interactive HTTP Web page using Agent technology was presented, normal HTTP Web page function add speech interface, made it possible allowing people using speech access to the Internet. Accessing to the speech interactive Web paper can not only using normal GUI explore, but also can using PDA, Mobil phone and other device which haven't keyboard. Using speech interactive Web page can release people concerning Information such as weather forecast, stock and traffic information etc, which can help user using speech acquire information.

Page 219


Keyword Spotting in Auto-Attendant System

Authors: Qing GUO, Yonghong YAN, Zhiwei LIN, Baosheng YUAN, Qingwei ZHAO, Jian LIU
Affiliation: Intel China Research Center
Mailto: baosheng.yuan@intel.com

ABSTRACT

In this paper, an auto-attendant system using finite state grammar (FSG) based on a continuous speech recognition (CSR) model is introduced. However, by using two virtual garbage models, one is to match the leading extraneous speech before the key name and the other to match the tailing extraneous speech following the key name, we managed to reach a more flexible and robust auto-attendant system. The experiment result show that, in our auto attendant system (about 240 names), to the name only test set and the sentence test set 1 composed of sentences that FSG can recognize, the recognition rate of the keyword spotting system is almost the same as that of FSG. To the sentence test set 2 composed of sentences that undefined in the FSG the keyword spotting system outperforms the FSG system remarkably. Not affecting the recognition accuracy of name only test set and the sentence test set 1, task dependent keyword models cut off additional 20% of error rate comparing with task independent keyword models in the sentence test set 2.

Page 223


Duration Modeling in Mandarin Connected Digit Recognition

Authors: Gang PENG, Bo ZHANG, William S-Y. WANG
Affiliation: Department of Electronic Engineering, City University of Hong Kong, Hong Kong
Mailto: gpeng@ee.cityu.edu.hk

ABSTRACT

Digit string recognition is required in many applications which need to recognize numbers such as telephone numbers, credit card numbers, date, etc. In order to design a high performance recog-nizer, duration information is explored in this study. In a Mandarin connected digit recognizer, insertion and deletion errors amount to more than two thirds of the total recognition errors because there exist two mono-phonemic digits and a heavily rhotacized vowel. In order to use duration information more efficiently, we propose a method to model context dependent word duration information and then incorporate it directly in the decoding algorithm. Experi-mental results show that this method reduces word error rate by as much as 32.1%.

Page 227


A Divergence-based Model Separation

Authors: Chao-Shih HUANG, Hsiao-Chuan WANG
Affiliation: Philips Research East Asia, Taipei
Department of Electrical Engineering, National Tsing Hua University, Hsinchu
Mailto: joseph.huang@philips.com
hcwang@ee.nthu.edu.tw

ABSTRACT

In this paper, a divergence-based training algorithm is proposed for model separation, where the relative divergence between models is derived from Kullback-Leibler (KL) information. We attempt to improve the discriminative power of existing model while the environment-matched training data is not available. It could be applied to improve the model discrimination after model-based compensation technique is performed for robust speech recognition. Traditionally, the model training is based on data driven such as maximum likelihood (ML) estimation or discriminative training. Compared to ML training, the minimum classification error (MCE) objective in discriminative training leads significant gain in accuracy. We attempt to improve
model discrimination based on an approximate classification error analysis, relative divergence. We found that the smaller the relative divergence is, the more discriminative powers of the two models are. In the proposed algorithm, we try to directly obtain the discriminant function for model training from the relative divergence. Thus, the model parameters can be adjusted based on minimum relative divergence. Experimental results demonstrate that the divergence-based model separation method can achieve better recognition performance.

Page 231


An Adaptive Information Retrieval System Based on Fuzzy Set

Authors: Shan GAO, Bo XU, TaiYi HUANG, ChengQing ZHONG
Affiliation: National Laboratory of Pattern Recognition, Institute of Automation, Chinese
Academy of Sciences, BeiJing
Mailto: Sgao@nlpr.ia.ac.cn

ABSTRACT

The advent of the World Wide Web has increased the importance of Information Retrieval. Retrieval strategies assign a measure of similarity between a
query and a document. We usually have a notion that the more often terms are found in both the document and query, the more “relevant” the document is deemed to be the query.[1] But how to retrieve relevant information from extremely large document collections is not easy. This paper describes a new approach for adaptive information retrieval based on fuzzy set. The system applied this approach can retrieve some relevant documents from the document collection according to the topic that a user query.

Page 235


A Time-domain Female-male Voice Conversion Algorithm

Authors: Li LIU, Tiecheng YU
Affiliation: Speech Processing Laboratory, Institute of Acoustics, Chinese Academy of Sciences, Beijing
Mailto: ll@speech1.ioa.ac.cn

ABSTRACT

In this paper, we put forward a time-domain female-male voice conversion algorithm. This method mainly focuses on two acoustic features that are thought to be the most important to speech individuality: pitch frequency and formant frequencies. To change pitch frequency, we cut off or add the low amplitude parts of speech signals in one pitch period. To change formants, according to the relationship between zero-cross rate and formants, and basing on the semi-waveform vector database which the former students formed during carrying out a speech waveform encoding algorithm, we use DTW technology to find a semi-waveform vector in the database to substitute the original semi-waveform. Experiments show that this algorithm
is feasible. The average pitch frequency ratio of female speech to male speech is about 1.5 and the average formant frequencies ratio of female to male is about 1.2. We also found that the converted male voice is better than the converted female voice.

Page 239


CCL eAttendant - An On-line Auto Attendant System

Authors: Szu-Chen JOU, Shih-Chieh CHIEN, Woei-Chyang SHIEH, Jau-Hung CHEN, Sen-Chia CHANG
Affiliation: Advanced Technology Center (ATC), Computers and Communication Research Laboratories (CCL)
Industrial Technology Research Institute (ITRI), HsinChu 310
Mailto: chang@itri.org.tw

ABSTRACT

In this paper, we present an on-line auto attendant system, CCL eAttendant, which has been employed on the CCL/ITRI telephone network since January 2000. This system is composed of speech recognition, text-to-speech, computer-telephony integration, and HTML data importer modules. It is based on WinTel architecture and is built on a Pentium-III PC with MS-Windows NT and a Dialogic D/41Esc telephony board. CCL eAttendant enables people to find CCL employees' extension numbers and forward calls by speech.

Page 243