Sixth International Conference on Spoken Language Processing
(ICSLP 2000)

Beijing, China
October 16-20, 2000

Broadcast News Transcription in Mandarin

Langzhou Chen, Lori Lamel, Gilles Adda, Jean-Luc Gauvain

Spoken Language Processing Group, LIMSI-CNRS, Orsay, France

In this paper, our work in developing a Mandarin broadcast news transcription system is described. The main focus of this work is a port of the LIMSI American English broadcast news transcription system to the Chinese Mandarin language. The system consists of an audio partitioner and an HMM-based continuous speech recognizer. The acoustic models were trained on about 24 hours of data from the 1997 Hub4 Mandarin corpus available via LDC. In addition to the transcripts, the language models were trained on Mandarin Chinese News Corpus containing about 186 million characters. We investigate recognition performance as a function of lexical size, with and without tone in the lexicon, and with a topic dependent language model. The transcription character error rate on the DARPA 1997 test set is 18.1% using a lexicon with 3 tone levels and a topic-based language model.

Full Paper

Bibliographic reference.  Chen, Langzhou / Lamel, Lori / Adda, Gilles / Gauvain, Jean-Luc (2000): "Broadcast news transcription in Mandarin", In ICSLP-2000, vol.2, 1015-1018.