Building Audio-Visual Phonetically Annotated Arabic Corpus for Expressive Text to Speech

Omnia Abdo, Sherif Abdou, Mervat Fashal


The present research aims to build an MSA audio-visual corpus. The corpus is annotated both phonetically and visually and dedicated to emotional speech processing studies. The building of the corpus consists of 5 main stages: speaker selection, sentences selection, recording, annotation and evaluation. 500 sentences were critically selected based on their phonemic distribution. The speaker was instructed to read the same 500 sentences with 6 emotions (Happiness – Sadness – Fear – Anger – Inquiry – Neutral). A sample of 50 sentences was selected for annotation. The corpus evaluation modules were: audio, visual and audio-visual subjective evaluation.

The corpus evaluation process showed that happy, anger and inquiry emotions were better recognized visually (94%, 96% and 96%) than audibly (63.6%, 74% and 74%) and the audio visual evaluation scores (96%, 89.6% and 80.8%). Sadness and fear emotion on the other hand were better recognized audibly (76.8% and 97.6%) than visually (58% and 78.8 %) and the audio visual evaluation scores were (65.6% and 90%).


 DOI: 10.21437/Interspeech.2017-1357

Cite as: Abdo, O., Abdou, S., Fashal, M. (2017) Building Audio-Visual Phonetically Annotated Arabic Corpus for Expressive Text to Speech. Proc. Interspeech 2017, 3767-3771, DOI: 10.21437/Interspeech.2017-1357.


@inproceedings{Abdo2017,
  author={Omnia Abdo and Sherif Abdou and Mervat Fashal},
  title={Building Audio-Visual Phonetically Annotated Arabic Corpus for Expressive Text to Speech},
  year=2017,
  booktitle={Proc. Interspeech 2017},
  pages={3767--3771},
  doi={10.21437/Interspeech.2017-1357},
  url={http://dx.doi.org/10.21437/Interspeech.2017-1357}
}