Session Poster A1


Intra-syllable Dependent Phonetic Modeling For Chinese Speech Recognition

Authors: Jiyong ZHANG, Fang ZHENG, Mingxing XU, Shuqing LI
Affiliation: Center of Speech Technology, State Key Laboratory
of Intelligent Technology and Systems, Department of Computer
Science & Technology, Tsinghua University, Beijing
Mailto: zjy@sp.cs.tsinghua.edu.cn

ABSTRACT

A novel acoustic modeling method for Chinese speech recognition based on Intra-Syllable Dependent Phone (ISDP) set is proposed. The ISDP set extends the traditional phone set based on the intra-syllable information of Chinese phonetic knowledge. The acoustic models based on ISDP set (ISDPMs) have the following features. First, they are suitable for the case of a rather small scale of training data. Second, this scheme is an integration form of the tri-phone modeling and the syllable modeling. The mixed Gaussian densities are used to describe the feature space of each ISDP and the Viterbi algorithm is adopted for decoding process. In addition, the ISDP-syllable search tree is designed and presented to reduce the decoding complexity. Our
Experimental result shows that the ISDP modeling is more flexible and faster than the syllable modeling meanwhile it causes no much performance reduction.

Page 73


Modeling of Three Types of Auditory Nerve and Its Application in Speech Recognition

Authors: Zhimin LIU, Xihong WU, Bin ZHEN, Huisheng CHI
Affiliation: Speech Group, National Key Laboratory on Machine Perception, Peking Univ., Beijing
Center for Information Science, Peking Univ., Beijing
Mailto: lzm@pubms.pku.edu.cn
wxh@cis.pku.edu.cn
zb@cis.pku.edu.cn
chi@pku.edu.cn

ABSTRACT

A novel auditory nerve model is described here which simulates the three types of auditory nerves existing in the auditory system. The inspiration of the model is the absence of the simulation of the different types of auditory nerves in current auditory models. Based on the previous work, three sub-models replace the prevailing single auditory nerve discharge model in the common peripheral auditory models. Three auditory features were extracted and applied in speech recognition experiments. The results show that three models have different noise-resistant properties and the model with large dynamic range is exceptionally robust in speech recognition.

Page 77


A Hierarchic Processing Model In Chinese TTS

Authors: Bo YIN, Ren-Hua WANG
Affiliation: Department of Electronic Engineering & Information Science,
University of Science & Technology of China, Heifei
Mailto: rhw@ustc.edu.cn

ABSTRACT

This paper puts forward a kind of hierarchy text processing model aimed at Chinese TTS system, and defines corresponding hierarchy labeling system. The actual realization on the hierarchic processing is also given in detail, and the processing tactics on the sub-phrase layer is specially discussed.

Page 81


A Promising Syllable Decomposition Method for Tone Languages' Speech Recognition

Authors: Haiping LI, Liqin SHEN, GuoKang FU, C.J. CHEN
Affiliation: IBM China Research Lab, Beijing
IBM Thomas J. Watson Research Center, Yorktown Heights, N.Y.
Mailto: lihp@cn.ibm.com
shenlq@cn.ibm.com
fugk@cn.ibm.com
juchen@us.ibm.com

ABSTRACT

A new syllable decomposition method, which uses the tone information of the main vowel in a syllable to distinguish the tone of the whole syllable, is proposed in this paper. Compared to the scheme, in which a syllable is decomposed into an initial and a final, and the tone information is carried on by the final, the new scheme reduces the number of phonemes in the phone set of a recognition system. It handles the syllabic languages especially the ones with complicated tonal phonology such as Cantonese successfully, and also can be generalized to other tonal languages. Experiments on both Cantonese and Mandarin to compare the performance of systems using these two schemes, lead to that the new scheme got a little bit better accuracy than the old one while reduces the number of phonemes dramatically in recognition system. Such a method is promising to be used in more real speech recognition system or product.

Page 85


Modeling And Decision Tree Based Prediction of Pitch Contour In IBM Mandarin Speech Synthesis System

Authors: Xiaochuan NIU, Liqin SHEN, Weibin ZHU, Qin SHI
Affiliation: IBM China Research Laboratory, Beijing
Mailto: niuxc@cn.ibm.com
shenlq@cn.ibm.com
zhuweib@cn.ibm.com
shiqin@cn.ibm.com

ABSTRACT

In this paper, a method of pitch contour modelling based on the hidden Markov model (HMM) states of an acoustic unit is presented. A pair of vectors is computed from the alignment of the speech data with the acoustic unit’s HMM states. The pitch contour feature of the acoustic unit is represented by the vector pair so that the variants of the acoustic unit’s pitch contour can be measured and compared. Using this model, pitch contour decision trees are constructed for phones in Mandarin from a single speaker’s continuous reading speech database. The trees are used in the Mandarin speech synthesis system, which is trained over the same database, to predict the pitch contour of a certain phone according to its phone context.   The naturalness of the synthesized Mandarin speech is highly improved.

Page 89


Some Prosodic Properties of MAT Speech Database

Authors: Yueh-chin CHANG, Wan-ling CHANG, Guang-Hui SYU, Hsiao-Chuan WANG
Affiliation: Institute of Linguistics, National Tsing Hua University, Hsinchu
Department of Electrical Engineering, National Tsing Hua University, Hsinchu
Mailto: hcwng@ee.nthu.edu.tw

ABSTRACT

MAT (Mandarin speech data across Taiwan) is a telephone speech data collection project conducted in Taiwan during 1995 - 1998. Over 7000 speakers have provided the speech data through the public telephone systems. Its outcome is a series of MAT databases. The plentiful speech data in MAT databases are valuable materials for the study of properties of Mandarin spoken in Taiwan. Some particular properties would be of interest to linguists and also useful for identifying the accent of Taiwanese. In this paper, several prosodic features in MAT databases are investigated. They are the stress patterns of disyllabic words, the intensity and duration of syllable finals, and the pitch pattern of lexical tones.

Page 93


Sub-Syllabic Acoustic Modeling Across Chinese Dialects

Authors: Wai-Kit LO, Helen M. MENG, P.C. CHING
Affiliation: Digital Signal Processing Laboratory, Dept. of Electronic Engineering,
Human-Computer Communications Laboratory, Dept. of Systems Engineering & Engineering Management,
The Chinese University of Hong Kong, Hong Kong
Mailto: wklo@ieee.org
hmmeng@se.cuhk.edu.hk
pcching@ee.cuhk.edu.hk

ABSTRACT

This paper presents a series of experiments on sub-syllabic unit selection across the two Chinese dialects – Mandarin and Cantonese. Evaluations are
based on syllable recognition using only acoustic information, and no lexical knowledge is incorporated. We use a variety of subsyllabic acoustic models, motivated by phonological and lingustic structures charactersitics of Chinese. Our results should provide a useful reference for work in large-vocabulary Chinese speech recognition, as well as related tasks, e.g. spoken document retrieval.

Page 97


Prosodic Alternative Units in a Mandarin Chinese Speech Synthesizer

Authors: Hongwei DING, Joerg HELBIG
Affiliation: Laboratory of Acoustics and Speech Communication, Dresden University of Technology, Dresden
Mailto: ding@eakss2.et.tu-dresden.de
helbig@eakss2.et.tu-dresden.de

ABSTRACT

The Mandarin Chinese synthesis component of the Dresden Speech Synthesizer DreSS is based on an inventory of syllabic units. The inventory contains all Chinese syllables with the possible tones in up to three phonetic variations for a correct modeling of the cross syllable coarticulation effects. In order to
improve the naturalness and fluency of the synthesized speech, the inventory was complemented with prosodic alternative units for non-accented syllables, especially for neutral tone particles. In this paper, two strategies of the generation of such units are compared ?the extraction from specially constructed carrier sentences and the extraction from read speech corpus of newspapers texts. The results of a listening test show the best performance for the units from carrier sentences.

Page 101


A CART-Based Hierarchical Stochastic Model for Prosodic Phrasing in Chinese

Authors: Xipeng SHEN, Bo XU
Affiliation: National Laboratory of Pattern Recognition,
Institute of Automation, Chinese Academy of Sciences, Beijing
Mailto: xpshen@nlpr.ia.ac.cn
xubo@nlpr.ia.ac.cn

ABSTRACT

A CART-Based stochastic model for prediction of prosodic phrase breaks from input text of Chinese is provided in this work. All the features used in this model are almost obtained automatically. A novel and efficient algorithm—LLW algorithm is proposed here. Experiments demonstrate a high success rate of prosodic phrase breaks prediction from input sentences with little syntactic information(81% success rate, 6.1% false rate).

Page 105


A Min-Nan Text-to-Speech System

Authors: Sin-Horng CHEN, Chen-Chung HO
Affiliation: Department of Communication Engineering, Chiao Tung University, Hsinchu
Mailto: schen@cc.nctu.edu.tw

ABSTRACT

This paper presents the implementation of a Min-Nan text-to-speech (TTS) system. The system is designed based on the same principle of developing a Mandarin TTS system proposed previously. It takes 877 base-syllables as basic synthesis units and uses a recurrent neural network (RNN) based prosody synthesizer to generate proper  prosodic parameters for synthesizing natural output speech. It is implemented by software and runs in real-time on PC. An informal subjective listening test confirmed that the system performed well. all synthetic speeches sounded well for well-tokenized texts and fair for texts with automatic tokenization.

Page 109


A Subband Speech Coding Scheme Based On Code Excited Linear Predictive Coding

Authors: Dongjian YUE, Peiqi CHAI
Affiliation: AI Laboratory, Department of Computer Science and Technology, Tongji University,Shanghai
Mailto: yuedjk@online.sh.cn

ABSTRACT

With the features of computer network based on packet switching mode, a new variable bitrate, subband speech coding scheme which is combined the CELP (Code Excited Linear Predictive Coding), vector quantization and wavelet decomposition techniques is proposed in this paper. This subband speech coding scheme based on CELP provides a flexible variable bitrate speech coding method and suits packet switching network. It allows the switch nodes to regulate or control the transmission bitrate of speech within a large flexible range actively. A lot of speech coding experiments show that the result is satisfying.

Page 113


The Features of Chinese Computer-aided Language Learning System

Authors: Dongjian YUE, Peiqi CHAI
Affiliation: AI Laboratory, Department of Computer Science and Technology, Tongji University,Shanghai
Mailto: yuedjk@online.sh.cn

ABSTRACT

In this paper, we firstly analyse the trend of spoken language or speech learning at present. We then investigate the problems of applying speech technology in language learning and key techniques to be used. According to the features of Chinese spoken language, we propose the rules and methods when a Computer Assisted Language Learning (CALL) system for learning Chinese is designed or realized.

Page 117


A Study of Phoneme and Syllable Duration Characteristics of Mandarin Chinese

Authors: Weizhong ZHU, Kenji MATSUI
Affiliation: Advanced Technology Research Laboratories, Matsushita Electric Industrial Co., Ltd., Kyoto
Mailto: zhu@crl.mei.co.jp
matsui@crl.mei.co.jp

ABSTRACT

The multiple regression model was used to study the phoneme and syllable duration characteristics of mandarin Chinese. The source speech material is a phonetically balanced text corpus collected from newspapers and spoken by a professional female announcer. Since the syllable, in an Initial/Final format, was adopted as a basic synthesis unit in our Chinese TTS system, the investigations were taken on both Initial/Final and syllable bases. RMS error values of the model are 18.6, 36.9 and 43.1 ms for Initial, Final, and syllable, respectively. The results are quite close to those reported in literature, which may use different approaches, such as neural networks. In the multiple regression model, an interesting finding is that the factor of the following syllable is much larger than that of the preceding syllable. This evidence is further discussed by focusing into two-syllable words in the utterances. From our informal listening tests, we confirmed that this approach improves the naturalness of synthetic speech as compared to our previous rule-bases duration model.

Page 121


A Sentence-Pitch-Contour Generation Method Using VQ/HMM for Mandarin Text-to-speech

Authors: Hung-Yan GU, Chung-Chieh YANG
Affiliation: Department of Electrical Engineering, National Taiwan University of Science and Technology, Taipei
Mailto: root@guhy.ee.ntust.edu.tw

ABSTRACT

In this paper, a method with sentence-wide optimization consideration is proposed to generate a Mandarin sentence's pitch-contour. The developed model is called the sentence pitch-contour HMM (SPC-HMM) due to its use of VQ (vector quantization) and HMM (hidden Markov model). To construct an SPC-HMM, the pitch-contours of the syllables from each training sentence are normalized on both time and pitch-height first. The method for pitch-height
normalization is effective and newly developed here. After normalization, the pitch-contour of each training syllable is vector quantized. Then, the quantization code and lexical tones of adjacent syllables are combined to define the observation symbol sequences for HMM training. In the synthesis phase, when given a sentence and its relevant text-analysis information, the most probable observation sequence is generated by finding the sentence-wide largest probability path with a dynamic-programming based algorithm. We had conducted practical perception tests. It is found that the speech synthesized by using the sentence pitch-contour generated from out method is slightly better than uttered by an ordinary speaker. Besides,
the comprehensibility of the synthesized speech is also promoted.

Page 125


A Study on the Contribution of Lexical Tones in Chinese LVCSR

Authors: Wai LAU, Y.W.WONG, W.K.LO, Tan LEE, P.C.CHING
Affiliation: Department of Electronic Engineering, The Chinese University of HONG KONG
Mailto: wlau@ee.cuhk.edu.hk
ywwong@ee.cuhk.edu.hk
wklo@ee.cuhk.edu.hk
tanlee@ee.cuhk.edu.hk
pcching@ee.cuhk.edu.hk

ABSTRACT

Tone is an indispensable component in tonal language such as Chinese and other Asian languages. This paper presents a comprehensive  study on the importance of lexical tones in Chinese dialects, namely Mandarin and Cantonese, in large-vocabulary continuous speech recognition (LVCSR) tasks. Based on the different tone accuracies, the improvement in recognition after incorporating tone information is examined. It is shown that in the best scanerio when searching a syllable lattice for character sequence with perfect tone information, an improvement in accuracy by 11.28% and 11.09% is achievable for Mandarin and Cantonese respectively.  There is also an improvement of around 8.5% when searching perfect syllable sequence for characters using perfect tone imformation.

Page 129


Corpus-based Cantonese Speech Synthesis With Non-uniform Units

Authors: K.M.LAW, K.Y.KWAN, Tan LEE
Affiliation:Department of Electronic Engineering, The Chinese University of HONG KONG
Mailto: kmlaw@ee.cuhk.edu.hk
kykwan@ee.cuhk.edu.hk
tanlee@ee.cuhk.edu.hk

ABSTRACT

This paper presents a corpus-based approach for Cantonese text-to-speech synthesis. We make use of a large corpus of  recordings of broadcast news over the radio. An acoustic inventory is built from speech segments extracted from this corpus. The extracted units are non-uniform in their linguistic lengths. More precisely they include lexical words and monosyllables with tones. The acoustic units are properly labeled with a set of linguistic attributes that mainly describe their phonetic and prosodic context. Speech synthesis is performed by simple concatenation of best-matching units available, without any kind of signal modification. The results of subjective listening test on a preliminary implementation of the proposed method are reported.

Page 133


Linguistic Features Selection in Fundament Frequency Patterns

Authors: Yiqiang CHEN, Wen GAO, Tingshao ZHU
Affiliation: Institute of Computing Technology, Chinese Academy of Sciences, Beijing
Dept. of Computing Science, University of Alberta, Edmonton
Mailto: yqchen@ict.ac.cn
wgao@ict.ac.cn
tszhu@cs.ualberta.ca

ABSTRACT

The prosodic pattern generation and prediction is more important for synthesizing natural-sounding speech reproduction of input Chinese text. In this paper, the typical pitch models are clustered from a large actual speech database firstly. Then we propose several methods including rough set method and Bayesian relief network on linguistic features selection, which can be directly used to predict pitch, energy, and duration patterns. A comparison between these two methods is proposed and to overcome each disadvantage, we combined the results of these two methods, and coded the most important features to Bayesian relief network firstly. After learning, some experiment shows the F0 model prediction based on the selected features is the same as original one for most pitches.

Page 137