ISCA Archive Interspeech 2005
ISCA Archive Interspeech 2005

Deriving a bi-lingual dictionary from raw transcription data

Peter Juel Henrichsen

We present a bigram-based method for deriving bi-lingual dictionary entries from two corpora of spontaneous speech (as represented in transcriptions). In contrast to e.g. [1], our method does not require translated or otherwise aligned texts; the corpora representing the source and target languages may be unrelated wrt. size, vocabulary richness, frequency distribution, and activity type. Examples are given using Danish and Swedish transcription data (and hints of English). We conclude with a discussion of the use of corpus-driven methods in language preservation and literation projects.


doi: 10.21437/Interspeech.2005-357

Cite as: Henrichsen, P.J. (2005) Deriving a bi-lingual dictionary from raw transcription data. Proc. Interspeech 2005, 2229-2232, doi: 10.21437/Interspeech.2005-357

@inproceedings{henrichsen05_interspeech,
  author={Peter Juel Henrichsen},
  title={{Deriving a bi-lingual dictionary from raw transcription data}},
  year=2005,
  booktitle={Proc. Interspeech 2005},
  pages={2229--2232},
  doi={10.21437/Interspeech.2005-357}
}