SLTU-2008 - First International Workshop on Spoken Languages Technologies for Under-Resourced Languages

Hanoi, Vietnam
May 5-7, 2008

Automatic Acquisition of Lexical Semantic Information Using Medium to Small Corpora

Mathias Rossignol (1), Pascale Sebillot (2)

(1) International Research Center MICA, Vietnam; (2) IRISA, France

Since many speech and text processing techniques are portable with a limited amount of work from one language to another, the most daunting task for NLP and SP practitioners becomes to build the resources need- ing for those tools to operate, In particular, the constitu- tion of “high-level” resources, such as advanced corpus annotations or linguistically motivated lexicons, can be extremely work-intensive. We present in this paper a system to assist the creation of semantic lexicons using small to medium-sized corpora, thanks to the combina- tion of semantic class constitution and topic detection, and the development of specific statistical data analy- sis techniques for relatively small datasets. By reduc- ing the amount of data needed for semi-automatic se- mantic lexicon acquisition, traditionally applied to 100 million-word corpus or more, we make this help for lex- ical resource acquisition applicable to the case of under- resourced languages. Index Terms— Semantic classes, small corpora, statistical data analysis, topic detection

Full Paper
Presentation (pdf)

Bibliographic reference.  Rossignol, Mathias / Sebillot, Pascale (2008): "Automatic acquisition of lexical semantic information using medium to small corpora", In SLTU-2008, 92-97.