An Amharic speech corpus for large vocabulary continuous speech recognition

Solomon Teferra Abate, Wolfgang Menzel, Bairu Tafila

Amharic is the official language of Ethiopia. It belongs to the Semitic language family and is characterized by a quite homogeneous phonology distinguishing between 234 distinct Consonant-Vowel (CV) syllables.

Since there is no Amharic speech corpus of any kind, we developed a read-speech corpus using a phonetically rich and balanced text database. To prepare the text database, we used the archive of EthioZena website which consists of selected articles from well known newspapers and magazines published in Amharic. The archive was cleaned semi-automatically.

Like other standard speech corpora, such as WSJCAM0, the Amharic speech corpus contains training set, speaker adaptation set, test sets (development and evaluation test sets each with 5000 and 20000 vocabulary size). The speech has been recorded in Ethiopia in an office environment and segmented semi-automatically. The corpus is now used for experiments with a syllable- and phonebased LVCSR for Amharic.

