7th International Conference on Spoken Language Processing

September 16-20, 2002
Denver, Colorado, USA

On Developing New Text and Audio Corpora and Speech Recognition Tools for the Turkish Language

Özgül Salor (1), Bryan Pellom (2), Tolga Çiloglu (1), Kadri Hacioglu (2), Mübeccel Demirekler (1)

(1) Middle East Technical University, Turkey; (2) University of Colorado at Boulder, USA

This paper describes recent work towards development of new corpora and tools for Turkish speech research. This effort represents an on-going collaboration between the Center for Spoken Language Research (CSLR) at the University of Colorado and the Department of Electrical Engineering at the Middle East Technical University (METU). A new text corpus developed from Turkish newspapers’ text is described. In addition, a 193-speaker audio corpus and pronunciation lexicon for the Turkish language is developed. We then describe our initial work towards porting Sonic, the CSLR speech recognition system, to the Turkish language. Results are shown for phonetic alignment and phoneme recognition accuracy using the newly constructed corpus and speech tools. It is shown that 91.2% of the automatically labeled phoneme boundaries are placed within 20 msec of hand-labeled locations for the Turkish audio corpus. Finally, a phoneme recognition error rate of 29.3% is demonstrated.

Full Paper

Bibliographic reference.  Salor, Özgül / Pellom, Bryan / Çiloglu, Tolga / Hacioglu, Kadri / Demirekler, Mübeccel (2002): "On developing new text and audio corpora and speech recognition tools for the turkish language", In ICSLP-2002, 349-352.