ISCA Tutorial and Research Workshop on Experimental Linguistics (ExLing 2008)

Athens, Greece
August 25-27, 2008

A New Arabic Stemming Algorithm

Eiman Tamah AlShammari, Jessica Lin

Department of Computer Science, George Mason University, USA

Text processing is a vital step in the information retrieval process, text mining, and natural language processing. It includes several stages, such as normalization, stop word removal, and stemming. Stemming is the process of reducing the lexicon to its root. Due to the different structures and rules in languages, the task of stemming is language-dependent. This research introduces a new stemming algorithm for the Arabic Language. Arabic is one of the most complex languages, both spoken and written. However, it is also one of the most common languages in the world. It is the base from which many other languages are derived. Despite the wide usage of the language, technology and development for Arabic has been limited. The main reason lies within the formulation rules of Arabic, as Arabic language exhibits a very complicated morphological structure. Existing Arabic stemmers suffer from high stemming error-rates. They blindly stem all the words and perform poorly, especially with compound words, proper nouns and foreign Arabized words. The main cause of this problem is the stemmer’s lack of knowledge of the word lexical category (i.e. noun, verb, proposition, etc.) This paper presents a new stemming algorithm that relies on Arabic language morphology and Arabic language syntax. The automated addition to the syntactic knowledge reduced both stemming error and stemming cost. Additionally, the new Algorithm automatically creates it is own list of proper nouns, and compound words based on the processed corpus.

