Third Workshop on Spoken Language Technologies for Under-resourced Languages

Cape Town, South Africa
May 7-9, 2012

A Study of a Non-Resourced Language: An Algerian Dialect

K. Meftouh (1), N. Bouchemal (1), K. Smaïli (2)

(1) UBMA, Badji Mokhtar University, Informatic Department, Annaba, Algeria
(2) LORIA, Campus scientifique, Vandoeuvre Lès Nancy, France

The objective of this paper is to present an under-resourced language related to Arabic. In fact, in several countries through the Arabic world, no one speaks the modern standard Arabic language. People speak something which is inspired from Arabic but could be very different from the modern standard Arabic. This one is reserved for the official broadcast news, official discourses and so on. The study of dialect is more difficult than any other natural language because it should be noted that this language is not written. This paper presents a linguistic study of an Algerian Arabic dialect, namely the dialect of Annaba (AD). In our knowledge, this is the first study made on Algerian dialect. It also presents the methodology used for building a parallel corpus: modern standard Arabic versus Arabic Dialect in order to achieve a machine translation for this pair of languages. This preliminary work is presented to try to attract the attention of the scientific community to this difficult and challenging problem. A realistic machine translation on Arabic should be done principally on dialect. This is our objective at a medium term.

Index Terms: Standard Arabic, Algerian Arabic dialect, parallel corpus, dialect of Annaba, distance of Levenshtein, Machine translation system.

