 |
Symposium on Machine Learning in Speech and Language Processing (MLSLP)
Bellevue, WA, USA
June 27, 2011 |
 |
A Non-Parametric Bayesian Approach to Inflectional Morphology
Jason Eisner, Markus Dreyer
Center for Language and Speech Processing, Johns Hopkins University, Baltimore, MD, USA
We learn how the words of a language are inflected, given a plain text corpus plus
a small supervised set of known paradigms. The approach is principled, simply
performing empirical Bayesian inference under a straightforward generative model
that explicitly describes the generation of
- The grammar and subregularities of the language (via many finite-state
transducers coordinated in a Markov Random Field).
- The infinite inventory of types and their inflectional paradigms
(via a Dirichlet Process Mixture Model based on the above grammar).
- The corpus of tokens (by sampling inflected words from the above
inventory).
Our inference algorithm cleanly integrates several techniques that handle the
different levels of the model: classical dynamic programming operations on the
finite-state transducers, loopy belief propagation in the Markov Random Field,
and MCMC and MCEM for the non-parametric Dirichlet Process Mixture Model.
We will build up the various components of the model in
turn, showing experimental results along the way for several intermediate tasks
such as lemmatization, transliteration, and inflection. Finally, we show that
modeling paradigms jointly with the Markov Random Field, and learning from
unannotated text corpora via the non-parametric model, significantly improves
the quality of predicted word inflections.
Bibliographic reference.
Eisner, Jason / Dreyer, Markus (2011):
"A non-parametric Bayesian approach to inflectional morphology",
In MLSLP-2011 (abstract).