SLTU-2008 - First International Workshop on Spoken Languages Technologies for Under-Resourced Languages
Since many speech and text processing techniques are portable with a limited amount of work from one language to another, the most daunting task for NLP and SP practitioners becomes to build the resources need- ing for those tools to operate, In particular, the constitu- tion of high-level resources, such as advanced corpus annotations or linguistically motivated lexicons, can be extremely work-intensive. We present in this paper a system to assist the creation of semantic lexicons using small to medium-sized corpora, thanks to the combina- tion of semantic class constitution and topic detection, and the development of specific statistical data analy- sis techniques for relatively small datasets. By reduc- ing the amount of data needed for semi-automatic se- mantic lexicon acquisition, traditionally applied to 100 million-word corpus or more, we make this help for lex- ical resource acquisition applicable to the case of under- resourced languages. Index Terms Semantic classes, small corpora, statistical data analysis, topic detection
Bibliographic reference. Rossignol, Mathias / Sebillot, Pascale (2008): "Automatic acquisition of lexical semantic information using medium to small corpora", In SLTU-2008, 92-97.