ISCA Archive

International Symposium on Chinese Spoken Language Processing (ISCSLP 2000)

Fragrant Hill Hotel, Beijing
October 13-15, 2000

Session Poster B1


Word-class Stochastic Model in A Spoken Language Dialogue System

Authors: Pengju YAN, Fang ZHENG, Mingxing XU, Yinfei HUANG
Affiliation: Center of Speech Technology, State Key Laboratory of Intelligent Technology and Systems,
Department of Computer Science & Technology, Tsinghua University, Beijing
Mailto: yanpj@sp.cs.tsinghua.edu.cn
fzheng@sp.cs.tsinghua.edu.cn
xumx@sp.cs.tsinghua.edu.cn
huangyf@sp.cs.tsinghua.edu.cn

ABSTRACT

Spoken Language Understanding (SLU) is a key component of spoken dialogue systems. One popular SLU method is to use the continuous speech recognizer where the Part-Of-Speech (POS) tagging is employed to determine the underlying word-class sequences. We present here a Word-Class Stochastic Model (WCSM) to describe the temporal word/word-class sequences, which is fit into the standard paradigm of the Hidden Markov
Models (HMMs). The model training is done on the basis of a general-purposed, large-vocabulary-sized, labeled corpus, which makes the model comparatively easy to construct. We apply the model to a prototype dialogue system named EasyNav, and the use of domain-specific knowledge, i.e., semantic-meaningful keywords, helps to increase the speed and accuracy of the POS tagging process.

Page 141


EasyCmd: Navigation by Voice Commands

Authors: Yinfei HUANG, Fang ZHENG, Wenhu WU
Affiliation: Center of Speech Technology, State Key Laboratory of Intelligent Technology and Systems,
Department of Computer Science & Technology, Tsinghua University, Beijing
Mailto: hifi@sp.cs.tsinghua.edu.cn

ABSTRACT

In this paper we present a system named EasyCmd that provides voice navigation on the desktop of Microsoft Window 9x system. Speech recognition engine for EasyCmd is much similar to that for dictation machine. Statistical Knowledge Based Frame Synchronous Search algorithm (SKBFSS) and Word Search Tree (WST) technologies are applied for acoustic decoding. Recognition Score Gap (RSG) is used for rejection. We also describe the techniques of
monitoring the system, collecting vocabulary and simulating system operations, which are essential to enhance the desktop with voice commands.

Page 145


Experiments and Analysis for Speaker Dependent Mandarin Syllable Recognition

Authors: Shuqing LI, Lei HE, Ditang FANG
Affiliation: Center of Speech Technology, State Key Laboratory of Intelligent Technology and Systems,
Department of Computer Science & Technology, Tsinghua University, Beijing
Mailto: fangdt@tsinghua.edu.cn

ABSTRACT

It is well known that the word accuracy of a speaker independent (SI) continuous speech recognition system cannot be good enough for many real-world applications due to many interference factors in speech signal: pronunciation variance by speakers, different kinds of environment noise, and so on. Thus, analyzing the action procedure of each interference factor, then eliminating its effect as possible via the inverse processing may significantly improve the performance of recognizer. In this paper, we make a series of experiments to find out the potential space of the inverse processing research for improving the performance of an applied SI continuous speech recognizer. These experiments are arranged in a perfect condition, in which all kinds of effects are avoided as possible. After the experimental results presentation with corresponding analysis, we give some suggestions for future research.

Page 149


A Comparison Between Synthetic Speech and Natural Speech of Chinese

Authors: Shinan LU, Lin HE, Ge YU, Yongkang FENG, Juan LIU
Affiliation: Institute of Acoustics, Academia Sinica, Beijing
City University of HONG KONG, HONG KONG
Mailto: lusn@info.unet.net.cn
96420180@plink.cityu.edu.hk

ABSTRACT

In this paper 15 synthetic Chinese sentences provided by four typical Chinese TTS system have been analyzed, and compared with natural speech. Results reveal the remarkable differences between natural speech and synthetic speech including the temporal organization and intonation, which are the essential cause of degrading naturalness of synthetic speech. Therefore the parser and prosody design are emphasized for developing a new Chinese TTS system.

Page 153


A Robust Method Based on Likelihood Estimation for Speech Signal Detecion

Authors: Shaoyan CHEN, Bo XU, Taiyi HUANG, Yintao YANG
Affiliation: Center of Space Science & Applied Research
Chinese Academy of Sciences, Beijing
National Laboratory of Pattern Recognition,Institute of Automation
Chinese Academy of Sciences, Beijing
Mailto: lijing@sp.cs.tsinghua.edu.cn

ABSTRACT

Speech signal detection is found to have a variety of applications in the speech communication. Many methods have been proposed for that purpose. Most of these methods can achieve very high detection accuracy for a reasonable given false alarm probability in clean speech environment. However, these methods become less reliable in the noisy environment. The accurate detection of speech signal is proven to be still very difficult in the presence of noise and interference. In this paper, we propose a method to use the likelihood estimated from a noise model to detect the speech signal. We shall address the problems on how to train a noise model, how to use the likelihood to detect the speech signal and how to use an on-line adaptation procedure to adapt the model parameters to a new noisy environment. We will also present experiment results to demonstrate some of the properties and advantages of the method.

Page 159


An New Approach for Incremental Speaker Adaptation

Authors: Yu WANG, Xiaoyan ZHU
Affiliation: State Key Laboratory of Intelligent Technology and Systems,
Department of Computer Science & Technology, Tsinghua University, Beijing
Mailto: zxy-dcs@tsinghua.edu.cn

ABSTRACT

This paper presents an approach for fast, incremental speaker adaptation based on MAP algorithm with a simplified MLLR module, which is used to minimizes the mismatches caused by the different speaking environments and speaker connatural characteristics before MAP processing. The most important advantage of the new approach is that it can not only have a quick adaptation with a few short utterances but also be more accurate even in a noisy environment. Experimental results show that using the new approach can improve the word error rate by 20.3% in a quiet environment, and by 27.6% in a noisy environment.

Page 163


Develop Telephony Speech Recognition Systems for Real-world Application

Authors: Xiangdong ZHANG, Baosheng YUAN, Ying JIA, Lingyun TUO, Yonghong YAN
Affiliation: Intel China Research Center, Beijing
Mailto: Edward.zhang@intel.com

ABSTRACT

This paper introduces our initial effort in building Mandarin acoustic model for Chinese stock information retrieval system based on Intel's LVCSR framework [1][2] . To build a robust and accurate system, a number of experiments were conducted to find the optimal parameters in various levels such as front-end feature, phonetic transcription, etc. We conducted comparison experiment to find the optimal configuration on the bandwidths
for the telephony acoustic model in general. To build an accurate task-specific modeling, we introduce a hybrid context-dependent modeling of which the task-dependent training data and the task-independent one are treated differently in the modeling. The experiment result on two task-specific applications shows the proposed modeling can produce significant WER reduction. The telephone corpora were collected at ICRC to improve the
robustness against both noise and channel effects.

Page 167


Noise-Robust Speech Recognition Based on Reliable Bands

Authors: Bo ZHANG, Gang PENG, William S-Y. WANG
Affiliation: DepartmentofElectronicEngineering,CityUniversityofHongKong,HongKong
Mailto: zhangbo@ee.cityu.edu.hk

ABSTRACT

Under noisy conditions, due to the redundancy of speech signal, there are some spectral bands (Reliable Bands) whose local SNR's are high enough to be used effectively by a recognizer. Based on this, a novel, phonetically motivated Reliable Bands Guided sim-ilarity measure (RBG measure) is proposed. It has the following features. Firstly, for reference spectrum, frequency bands which have larger absolute energy or sharper spectral peaks are marked as reliable bands. They are to be given more weight than the other bands in the definition of the RBG measure. Secondly, within each reliable band, similarity between formant positions and formant shapes of test spectrum and reference spectrum is explicitly mod-eled. Lastly, the measure can automatically emphasize spectral bands whose amplitudes change abruptly, which normally contain more reliable dynamic features of the speech signal. Both the RBG measure and the PMC method are tested on a speaker-independent, continuous Mandarin digit string recognition task, under 15 noisy conditions. Noises are drawn from the NOISEX92 database. The RBG measure shows an average 4.22% word accuracy score below the PMC method above 0 dB. However, it outperforms the PMC method by 8.82% at 0 dB. More importantly, the RBG measure does not rely on accurate end-point detection and accurate mod-eling of the background noise, which are difficult tasks in them-selves. To further improve the performance of the RBG measure, we dis-cuss the possibility of integrating the findings in the Computational Auditory Scene Analysis (CASA) field into the current system. First, we reviewed the theory of Auditory Scene Analysis, which was originally established by Bregman in 1990. We then discuss some computational models which were proposed for separating input sounds mixture into different sound streams. Finally we con-sider the possibility of integrating such models into the RBG mea-sure.

Page 171


A Multilingual Spoken Dialog System

Authors: Yunbiao XU, Masahiro ARAKI, Yasuhisa NIIMI
Affiliation: Department of electronics & information science,
Kyoto Institute of Technology, Matsugasaki, Sakyo-ku, Kyoto
Mailto: yunbiao@vox.dj.kit.ac.jp
araki@dj.kit.ac.jp
niimi@dj.kit.ac.jp

ABSTRACT

This paper will briefly introduce MSDSKIT-1 (Multilingual Spoken Dialogue System Version 1.0 developed by Kyoto Institute of Technology) which integrates Japanese and Chinese now. It is a promotion vision of the SDSKIT-3 (Spoken Dialogue System in Japanese). This system can provide services such as sight-seeing introduction, traffic guidance, hotel reservation. A user can also plan his itinerary under the conduction of the system. We regard a spoken dialogue system as an integrated system with a language-dependent speech interface and a language-independent dialogue controller. We must carefully consider the linguistic characteristics of the particular language for the language-dependent interface during designing a multilingual spoken dialogue system, for example, the syntactical structural features for the language parser. In order to promote SDSKIT-3 into a multilingual system (called
as MSDSKIT-1), a great effort has been taken. This paper will present such effort on two aspects: (1) Chinese speech recognizer (2) Chinese language parser.

Page 175


Selection of Different Language Model Using Dialogue State

Authors: Yong WANG, Jiang HAN, Jian LIU
Affiliation: Intel China Research Center
Mailto: yong.wang@intel.com
jiang.han@intel.com
jian.liu@intel.com

ABSTRACT

Domain-specific dialogue system is an important and also commercial-practicable application of speech recognition technique, and it is very helpful to decrease the search space in the aspects of accuracy improvement and search time reduction in speech recognition. Adequate use of dialogue-state-dependent language models in dialogue systems can decrease the search space greatly if a reasonable prediction of the dialogue states is feasible, and will make a dialogue system more robust in real practice. This paper presents a novel method of selecting different rule-based sub-language-models based on dialogue states to decrease the search space, which will select an adequate rule-based sub-language model in different conversation step according to the context. Experiments show that it is simple and effective in improving accuracy and recognition speed, and will be very useful in small and medium task domain.

Page 179


The Analysis of Copus Oriented Spoken Chinese Dialog Understanding

Authors: Yun ZHOU, Taiyi HUANG, Bing ZHAO
Affiliation: National Laboratory of Pattern Recognition,
Institute of Automation, Chinese Academy of Sciences, Beijing
Mailto: zhouyun@nlpr.ia.ac.cn
huang@nlpr.ia.ac.cn
bzhao@nlpr.ia.ac.cn

ABSTRACT

Such issues as dialog structure, dialog act analysis, turn segmentation (that is, segment a turn into several sentences or utterance units) have not yet been successfully resolved, especially in spoken Chinese dialog. Our corpus consists of 94(more than 3,000 turns) telephone-recorded Chinese human-human dialogues in the domain of room reservation.In this paper, we give some results of analysis of the corpus. We concentrate on four important phenomena in spoken Chinese: sentences hyperbaton, sentence fragment, speech repair, cue phrase. These four phenomena, we think,are essential for turn segmentation as well as other problems.

Page 183


On Integrating Tonal Information Into Chinese Speech Recognition

Authors: Xia WANG, Yuan DONG, Juha Iso-Sipil, Olli Viikki
Affiliation: Nokia (China) Research & Development Center, Beijing
Nokia Research Center, Speech and Audio Systems Laboratory, Tampere
Mailto: xia.s.wang@nokia.com
yuan.dong@nokia.com
juha.iso-sipila@nokia.com
olli.viikki@nokia.com

ABSTRACT

Reliable pitch detection is important in Chinese speech recognition since Chinese is a tonal language. In this paper, several pitch information integration approaches are investigated. In a noise-free environment, conventional pitch estimators work quite well. In adverse conditions, however, robustness of pitch detection algorithms is still a challenging problem. Our experimental results show that by using pitch information, a performance improvement can be obtained in a clean environment. However, a substantial recognition accuracy degradation is observed in adverse conditions due to the noise sensitivity of pitch estimators. Our experimental results indicate that front-end extracting higher-order cepstral coefficients provides the best results when testing the recognition performance in Chinese.

Page 187


Keyword Spotting By Searching The Syllable Lattices

Authors: Chia-Hsien LIN, Hsiao-Chuan WANG
Affiliation: Department of Electrical Engineering, National Tsing Hua University, Hsinchu
Mailto: hcwang@ee.nthu.edu.tw

ABSTRACT

This paper presents a keyword spotting method based on searching a syllable lattice structure. The Mandarin syllables are represented in initial-final models. By one-stage dynamic programming, an utterance is converted into a sequence of top-N- candidate syllables. It comes out a syllable lattice structure for this input utterance. A vocabulary of predefined keywords is represented as a set of syllable sequences. By searching the syllable sequences of keywords in the syllable lattice structure, we can spot the keywords in the utterance. A ranking and scoring algorithm is proposed for searching the keywords. The utterance verification for non-keyword rejection is also implicitly presented in this proposed algorithm.

Page 191


RCD Sub-syllable HMM Modeling By Decision Tree Clustering Using MAT-2000 Database

Authors: Yih-Ru WANG, Ke-Shu CHEN
Affiliation: Dept. of Communication Engineering,NCTU,Hzinchu
ATC/CCL Industrial Technology Research Institute, Chutung, Hzingchu
Mailto: yrwang@cc.nctu.edu.tw

ABSTRACT

In this paper, the decision tree clustering method using two different similarity measures as model splitting criteria is applied to continuous Mandarin speech recognition for training right-context-dependent (RCD) sub-syllable HMM models. Instead of using phone-like units, we adopt initial and final sub-syllable as the basic recognition units. A large telephone-speech database, MAT-2000, is used to test the training method. A recognition rate of 67.3% was obtained for a 500-sentence test set. As compared with the case of using context-independent (CI) models, a recognition rate improvement of 3.3% was achieved.

Page 195


Initial Experiments On Recognition of Internet-Accessible Compressed Mandarin Speech

Authors: Wei-ping HSIEH, Berlin CHEN, Kuan-ting CHEN, Hsin-ming WANG
Affiliation: Institute of Information Science, Academia Sinica, Taipei
Mailto: swp@iis.sinica.edu.tw
berlin@iis.sinica.edu.tw
kenneth@iis.sinica.edu.tw
whm@iis.sinica.edu.tw

ABSTRACT

Massive quantities of spoken audio are becoming available on the web. For example, many radio and television stations are now broadcasting Internet-accessible contents. Automatic recognition of spoken audio that has been degraded by the compression schemes, which enable the delivery of streaming audio over the Internet, could be of great interest for indexing and retrieval purposes. Considerring the characteristics and monosyllabic structure of the Chinese language, a syllable-based framework for retrieving Mandarin broadcast news has been investigated at Academia Sinica Taipei. This paper reports on out initial experiments on recognition of Internet-accessible Mandarin broadcast news in two data types - RealAudio and TrueSpeech.

Page 199