![]() |
ISCA ArchiveInternational Symposium on Chinese Spoken Language Processing (ISCSLP 2000)Fragrant Hill Hotel, Beijing
|
![]() |
ABSTRACT
Spoken Language Understanding (SLU) is a key component of
spoken dialogue systems. One popular SLU method is to use the continuous speech recognizer
where the Part-Of-Speech (POS) tagging is employed to determine the underlying word-class
sequences. We present here a Word-Class Stochastic Model (WCSM) to describe the temporal
word/word-class sequences, which is fit into the standard paradigm of the Hidden Markov
Models (HMMs). The model training is done on the basis of a general-purposed,
large-vocabulary-sized, labeled corpus, which makes the model comparatively easy to
construct. We apply the model to a prototype dialogue system named EasyNav, and the use of
domain-specific knowledge, i.e., semantic-meaningful keywords, helps to increase the speed
and accuracy of the POS tagging process.
ABSTRACT
In this paper we present a system named EasyCmd that
provides voice navigation on the desktop of Microsoft Window 9x system. Speech recognition
engine for EasyCmd is much similar to that for dictation machine. Statistical Knowledge
Based Frame Synchronous Search algorithm (SKBFSS) and Word Search Tree (WST) technologies
are applied for acoustic decoding. Recognition Score Gap (RSG) is used for rejection. We
also describe the techniques of
monitoring the system, collecting vocabulary and simulating system operations, which are
essential to enhance the desktop with voice commands.
ABSTRACT
It is well known that the word accuracy of a speaker independent (SI) continuous speech recognition system cannot be good enough for many real-world applications due to many interference factors in speech signal: pronunciation variance by speakers, different kinds of environment noise, and so on. Thus, analyzing the action procedure of each interference factor, then eliminating its effect as possible via the inverse processing may significantly improve the performance of recognizer. In this paper, we make a series of experiments to find out the potential space of the inverse processing research for improving the performance of an applied SI continuous speech recognizer. These experiments are arranged in a perfect condition, in which all kinds of effects are avoided as possible. After the experimental results presentation with corresponding analysis, we give some suggestions for future research.
ABSTRACT
In this paper 15 synthetic Chinese sentences provided by four typical Chinese TTS system have been analyzed, and compared with natural speech. Results reveal the remarkable differences between natural speech and synthetic speech including the temporal organization and intonation, which are the essential cause of degrading naturalness of synthetic speech. Therefore the parser and prosody design are emphasized for developing a new Chinese TTS system.
ABSTRACT
Speech signal detection is found to have a variety of applications in the speech communication. Many methods have been proposed for that purpose. Most of these methods can achieve very high detection accuracy for a reasonable given false alarm probability in clean speech environment. However, these methods become less reliable in the noisy environment. The accurate detection of speech signal is proven to be still very difficult in the presence of noise and interference. In this paper, we propose a method to use the likelihood estimated from a noise model to detect the speech signal. We shall address the problems on how to train a noise model, how to use the likelihood to detect the speech signal and how to use an on-line adaptation procedure to adapt the model parameters to a new noisy environment. We will also present experiment results to demonstrate some of the properties and advantages of the method.
ABSTRACT
This paper presents an approach for fast, incremental speaker adaptation based on MAP algorithm with a simplified MLLR module, which is used to minimizes the mismatches caused by the different speaking environments and speaker connatural characteristics before MAP processing. The most important advantage of the new approach is that it can not only have a quick adaptation with a few short utterances but also be more accurate even in a noisy environment. Experimental results show that using the new approach can improve the word error rate by 20.3% in a quiet environment, and by 27.6% in a noisy environment.
ABSTRACT
This paper introduces our initial effort in building
Mandarin acoustic model for Chinese stock information retrieval system based on Intel's
LVCSR framework [1][2] . To build a robust and accurate system, a number of experiments
were conducted to find the optimal parameters in various levels such as front-end feature,
phonetic transcription, etc. We conducted comparison experiment to find the optimal
configuration on the bandwidths
for the telephony acoustic model in general. To build an accurate task-specific modeling,
we introduce a hybrid context-dependent modeling of which the task-dependent training data
and the task-independent one are treated differently in the modeling. The experiment
result on two task-specific applications shows the proposed modeling can produce
significant WER reduction. The telephone corpora were collected at ICRC to improve the
robustness against both noise and channel effects.
ABSTRACT
Under noisy conditions, due to the redundancy of speech signal, there are some spectral bands (Reliable Bands) whose local SNR's are high enough to be used effectively by a recognizer. Based on this, a novel, phonetically motivated Reliable Bands Guided sim-ilarity measure (RBG measure) is proposed. It has the following features. Firstly, for reference spectrum, frequency bands which have larger absolute energy or sharper spectral peaks are marked as reliable bands. They are to be given more weight than the other bands in the definition of the RBG measure. Secondly, within each reliable band, similarity between formant positions and formant shapes of test spectrum and reference spectrum is explicitly mod-eled. Lastly, the measure can automatically emphasize spectral bands whose amplitudes change abruptly, which normally contain more reliable dynamic features of the speech signal. Both the RBG measure and the PMC method are tested on a speaker-independent, continuous Mandarin digit string recognition task, under 15 noisy conditions. Noises are drawn from the NOISEX92 database. The RBG measure shows an average 4.22% word accuracy score below the PMC method above 0 dB. However, it outperforms the PMC method by 8.82% at 0 dB. More importantly, the RBG measure does not rely on accurate end-point detection and accurate mod-eling of the background noise, which are difficult tasks in them-selves. To further improve the performance of the RBG measure, we dis-cuss the possibility of integrating the findings in the Computational Auditory Scene Analysis (CASA) field into the current system. First, we reviewed the theory of Auditory Scene Analysis, which was originally established by Bregman in 1990. We then discuss some computational models which were proposed for separating input sounds mixture into different sound streams. Finally we con-sider the possibility of integrating such models into the RBG mea-sure.
ABSTRACT
This paper will briefly introduce MSDSKIT-1 (Multilingual
Spoken Dialogue System Version 1.0 developed by Kyoto Institute of Technology) which
integrates Japanese and Chinese now. It is a promotion vision of the SDSKIT-3 (Spoken
Dialogue System in Japanese). This system can provide services such as sight-seeing
introduction, traffic guidance, hotel reservation. A user can also plan his itinerary
under the conduction of the system. We regard a spoken dialogue system as an integrated
system with a language-dependent speech interface and a language-independent dialogue
controller. We must carefully consider the linguistic characteristics of the particular
language for the language-dependent interface during designing a multilingual spoken
dialogue system, for example, the syntactical structural features for the language parser.
In order to promote SDSKIT-3 into a multilingual system (called
as MSDSKIT-1), a great effort has been taken. This paper will present such effort on two
aspects: (1) Chinese speech recognizer (2) Chinese language parser.
ABSTRACT
Domain-specific dialogue system is an important and also commercial-practicable application of speech recognition technique, and it is very helpful to decrease the search space in the aspects of accuracy improvement and search time reduction in speech recognition. Adequate use of dialogue-state-dependent language models in dialogue systems can decrease the search space greatly if a reasonable prediction of the dialogue states is feasible, and will make a dialogue system more robust in real practice. This paper presents a novel method of selecting different rule-based sub-language-models based on dialogue states to decrease the search space, which will select an adequate rule-based sub-language model in different conversation step according to the context. Experiments show that it is simple and effective in improving accuracy and recognition speed, and will be very useful in small and medium task domain.
ABSTRACT
Such issues as dialog structure, dialog act analysis, turn segmentation (that is, segment a turn into several sentences or utterance units) have not yet been successfully resolved, especially in spoken Chinese dialog. Our corpus consists of 94(more than 3,000 turns) telephone-recorded Chinese human-human dialogues in the domain of room reservation.In this paper, we give some results of analysis of the corpus. We concentrate on four important phenomena in spoken Chinese: sentences hyperbaton, sentence fragment, speech repair, cue phrase. These four phenomena, we think,are essential for turn segmentation as well as other problems.
ABSTRACT
Reliable pitch detection is important in Chinese speech recognition since Chinese is a tonal language. In this paper, several pitch information integration approaches are investigated. In a noise-free environment, conventional pitch estimators work quite well. In adverse conditions, however, robustness of pitch detection algorithms is still a challenging problem. Our experimental results show that by using pitch information, a performance improvement can be obtained in a clean environment. However, a substantial recognition accuracy degradation is observed in adverse conditions due to the noise sensitivity of pitch estimators. Our experimental results indicate that front-end extracting higher-order cepstral coefficients provides the best results when testing the recognition performance in Chinese.
ABSTRACT
This paper presents a keyword spotting method based on searching a syllable lattice structure. The Mandarin syllables are represented in initial-final models. By one-stage dynamic programming, an utterance is converted into a sequence of top-N- candidate syllables. It comes out a syllable lattice structure for this input utterance. A vocabulary of predefined keywords is represented as a set of syllable sequences. By searching the syllable sequences of keywords in the syllable lattice structure, we can spot the keywords in the utterance. A ranking and scoring algorithm is proposed for searching the keywords. The utterance verification for non-keyword rejection is also implicitly presented in this proposed algorithm.
ABSTRACT
In this paper, the decision tree clustering method using two different similarity measures as model splitting criteria is applied to continuous Mandarin speech recognition for training right-context-dependent (RCD) sub-syllable HMM models. Instead of using phone-like units, we adopt initial and final sub-syllable as the basic recognition units. A large telephone-speech database, MAT-2000, is used to test the training method. A recognition rate of 67.3% was obtained for a 500-sentence test set. As compared with the case of using context-independent (CI) models, a recognition rate improvement of 3.3% was achieved.
ABSTRACT
Massive quantities of spoken audio are becoming available on the web. For example, many radio and television stations are now broadcasting Internet-accessible contents. Automatic recognition of spoken audio that has been degraded by the compression schemes, which enable the delivery of streaming audio over the Internet, could be of great interest for indexing and retrieval purposes. Considerring the characteristics and monosyllabic structure of the Chinese language, a syllable-based framework for retrieving Mandarin broadcast news has been investigated at Academia Sinica Taipei. This paper reports on out initial experiments on recognition of Internet-accessible Mandarin broadcast news in two data types - RealAudio and TrueSpeech.