CALYOU: A Comparable Spoken Algerian Corpus Harvested from YouTube

K. Abidi, M.A. Menacer, Kamel Smaïli


This paper addresses the issue of comparability of comments extracted from Youtube. The comments concern spoken Algerian that could be either local Arabic, Modern Standard Arabic or French. This diversity of expression gives rise to a huge number of problems concerning the data processing. In this article, several methods of alignment will be proposed and tested. The method which permits to best align is Word2Vec-based approach that will be used iteratively. This recurrent call of Word2Vec allows us improve significantly the results of comparability. In fact, a dictionary-based approach leads to a Recall of 4, while our approach allows one to get a Recall of 33 at rank 1. Thanks to this approach, we built from Youtube CALYOU, a Comparable Corpus of the spoken Algerian.


 DOI: 10.21437/Interspeech.2017-1305

Cite as: Abidi, K., Menacer, M., Smaïli, K. (2017) CALYOU: A Comparable Spoken Algerian Corpus Harvested from YouTube. Proc. Interspeech 2017, 3742-3746, DOI: 10.21437/Interspeech.2017-1305.


@inproceedings{Abidi2017,
  author={K. Abidi and M.A. Menacer and Kamel Smaïli},
  title={CALYOU: A Comparable Spoken Algerian Corpus Harvested from YouTube},
  year=2017,
  booktitle={Proc. Interspeech 2017},
  pages={3742--3746},
  doi={10.21437/Interspeech.2017-1305},
  url={http://dx.doi.org/10.21437/Interspeech.2017-1305}
}