ISCA & IEEE Workshop on Spontaneous Speech Processing and Recognition

April 13-16, 2003
Tokyo Institute of Technology, Tokyo, Japan

MATBN 2002: A Mandarin Chinese Broadcast News Corpus

Hsin-min Wang

Institute of Information Science, Academia Sinica, Taipei, Taiwan

The MATBN 2002 Mandarin Chinese broadcast news corpus contains a total of 40 hours of broadcast news from Public Television Service Foundation (Taiwan) with corresponding transcripts. The primary motivation for this collection is to provide training and testing data for continuous speech recognition evaluation in the broadcast domain. We expect to collect and process 220 hours of Mandarin Chinese broadcast news speech over 3 years. At the end of the first year, the 40 hour broadcast news corpus has been completed on schedule and is scheduled to be releasable in early 2003. According to our plan, we expect to release the interim 120 hour broadcast news corpus in late 2003 and the final 220 hour broadcast news corpus in late 2004.

Full Paper

Bibliographic reference.  Wang, Hsin-min (2003): "MATBN 2002: A Mandarin Chinese broadcast news corpus", in SSPR-2003, paper TAP3.