The present research aims to build an MSA audio-visual corpus. The
corpus is annotated both phonetically and visually and dedicated to
emotional speech processing studies. The building of the corpus consists
of 5 main stages: speaker selection, sentences selection, recording,
annotation and evaluation. 500 sentences were critically selected based
on their phonemic distribution. The speaker was instructed to read
the same 500 sentences with 6 emotions (Happiness – Sadness –
Fear – Anger – Inquiry – Neutral). A sample of 50
sentences was selected for annotation. The corpus evaluation modules
were: audio, visual and audio-visual subjective evaluation.
The corpus evaluation
process showed that happy, anger and inquiry emotions were better recognized
visually (94%, 96% and 96%) than audibly (63.6%, 74% and 74%) and the
audio visual evaluation scores (96%, 89.6% and 80.8%). Sadness and
fear emotion on the other hand were better recognized audibly (76.8%
and 97.6%) than visually (58% and 78.8 %) and the audio visual evaluation
scores were (65.6% and 90%).
Cite as: Abdo, O., Abdou, S., Fashal, M. (2017) Building Audio-Visual Phonetically Annotated Arabic Corpus for Expressive Text to Speech. Proc. Interspeech 2017, 3767-3771, doi: 10.21437/Interspeech.2017-1357
@inproceedings{abdo17_interspeech, author={Omnia Abdo and Sherif Abdou and Mervat Fashal}, title={{Building Audio-Visual Phonetically Annotated Arabic Corpus for Expressive Text to Speech}}, year=2017, booktitle={Proc. Interspeech 2017}, pages={3767--3771}, doi={10.21437/Interspeech.2017-1357} }