Interspeech'2005 - Eurospeech

Lisbon, Portugal
September 4-8, 2005

Deriving a Bi-Lingual Dictionary from Raw Transcription Data

Peter Juel Henrichsen

Copenhagen Business School, Denmark

We present a bigram-based method for deriving bi-lingual dictionary entries from two corpora of spontaneous speech (as represented in transcriptions). In contrast to e.g. [1], our method does not require translated or otherwise aligned texts; the corpora representing the source and target languages may be unrelated wrt. size, vocabulary richness, frequency distribution, and activity type. Examples are given using Danish and Swedish transcription data (and hints of English). We conclude with a discussion of the use of corpus-driven methods in language preservation and literation projects.

Full Paper

Bibliographic reference.  Henrichsen, Peter Juel (2005): "Deriving a bi-lingual dictionary from raw transcription data", In INTERSPEECH-2005, 2229-2232.