This paper presents empirical results of a corpus-based study attempting to characterize linguistic features of spontaneous Mandarin, which has been difficult to obtain before due to the lack of suitable speech material. Starting from linguistic considerations, these results of word frequency as well as syllable frequency should provide important cues to spontaneous speech production. Frequent words or syllables need special investigations into their phonetic forms in real production. Examinations of syllable structures also show that the distribution of onset consonant, nucleus and coda consonant in syllables which are often used in spontaneous Mandarin is similar across different speakers. And results of a segmental analysis also clearly indicate the likelihood of a segment being produced in spoken Mandarin. 1. INTRODUCTION Conventionally, linguistic studies mainly rely on field works to document the use of languages often with a research focus on pronunciation, lexicon and sentence grammar. With the database construction methodology developed in corpus linguistics, new approaches to analyzing spoken language have become possible recently. It has an essential influence on spontaneous speech studies, because due to limitations of data size and database management it has been difficult to investigate and model spontaneous speech using the traditional research methods. This paper uses a corpus of spontaneous data to examine Mandarin, which is spoken in Taiwan. What we report in this paper is a new attempt to obtain linguistic characteristic of spontaneous Mandarin. The results are primitive, but with a great potential to be developed into a deep and systematic understanding of spoken language production. Frequency of actively used words and syllables in spoken language provides useful cues to a correct lexical selection, when the available acoustic information is not clear enough to select words in the lexicon. A lexicon for speech recognition systems, probably similar to the mental lexicon of a speaker, needs to store different phonetic forms of words for instance reduction, assimilation and contraction [4]. It is not realistic to consider all phonetic variations of all words listed in a standard dictionary, so frequent words are no doubt the most important ones we need to take into account first. Word and syllable frequency as well as segmental analysis can be of great use and this information can be systematically obtained by using spoken corpora. In addition, for notation used in this paper, lexical tones in Taiwan Mandarin have four marked realizations: (1) high level tone, (2) rising tone, (3) contour tone and (4) falling tone and the unmarked neutral tone (5). Different Chinese dialects have different numbers of lexical tones associated with different melodic values [2]. Throughout this paper, we use Pinyin to transcribe Mandarin words. 2. DATA AND GENERAL STATISTICS Mandarin Conversational Dialogue Corpus (MCDC) was collected at the Institute of Linguistics, Academia Sinica from 2000 to 2001 [3]. It consists of eight transcribed conversations between strangers. The recorded speech data has a total length of approximately eight hours (the corpus will soon be released for public use). Because no blanks are available in the writing system of Mandarin to separate individual words, we have to segment the transcripts into words first. In order to ensure that the segmentation results are consistent, the automatic word segmentation and tagging system developed by the Chinese Knowledge Information Processing Group at Academia Sinica [1] is adopted to automatically segment word boundaries and syntactically tag the segmented words. General statistics are listed in Table 1. Table 1: General statistics of MCDC Speaker Sex Age Syllables Word types Word tokens Syllable/word ratio S-01 F 29 4,789 921 3,334 1.44 S-02 M 25 9,262 1,445 6,913 1.34 S-03 F 37 8,522 1,140 5,853 1.46 S-04 M 35 6,202 965 4,234 1.46 S-05 F 16 9,273 1,093 6,339 1.46 S-06 F 17 6,659 874 4,497 1.48 S-07 M 40 8,887 1,283 6,946 1.28 S-08 F 46 7,360 1,140 5,497 1.34 S-09 F 30 2,687 572 1,967 1.37 S-10 F 35 13,534 1,577 9,103 1.49 S-11 M 35 7,140 1,104 4,399
Cite as: Tseng, S. (2004) Spontaneous Mandarin Production: Results of a Corpus-Based Study. Proc. International Symposium on Chinese Spoken Language Processing, 29-32
@inproceedings{tseng04_iscslp, author={ShuChuan Tseng}, title={{Spontaneous Mandarin Production: Results of a Corpus-Based Study}}, year=2004, booktitle={Proc. International Symposium on Chinese Spoken Language Processing}, pages={29--32} }