ISCA Archive SLTU 2014
ISCA Archive SLTU 2014

Towards automatic speech recognition without pronunciation dictionary, transcribed speech and text resources in the target language using cross-lingual word-to-phoneme alignment

Felix Stahlberg, Tim Schlippe, Stephan Vogel, Tanja Schultz

In this paper we tackle the task of bootstrapping an Automatic Speech Recognition system without an a priori given language model, a pronunciation dictionary, or transcribed speech data for the target language Slovene – only untranscribed speech and translations to other resource-rich source languages of what was said are available. Therefore, our approach is highly relevant for under-resourced and non-written languages. First, we borrow acoustic models from a strongly related language (Croatian) and apply a Croatian phoneme recognizer to the Slovene speech. Second, we segment the recognized phoneme strings into word units using cross-lingual word-tophoneme alignment. Third, we compensate for phoneme recognition and alignment errors in the segmented phoneme sequences and aggregate the resulting phoneme sequence segments in a pronunciation dictionary for Slovene. Orthographic representations are generated using a Croatian phoneme-to-grapheme model. Finally, we use the resulting dictionary and the Croatian acoustic models to recognize Slovene. Our best recognizer achieves a Character Error Rate of 52% on the BMED corpus.

Index Terms: pronunciation dictionary, non-written languages, word-to-phoneme alignment, language discovery, zeroresource


Cite as: Stahlberg, F., Schlippe, T., Vogel, S., Schultz, T. (2014) Towards automatic speech recognition without pronunciation dictionary, transcribed speech and text resources in the target language using cross-lingual word-to-phoneme alignment. Proc. 4th Workshop on Spoken Language Technologies for Under-Resourced Languages (SLTU 2014), 73-80

@inproceedings{stahlberg14_sltu,
  author={Felix Stahlberg and Tim Schlippe and Stephan Vogel and Tanja Schultz},
  title={{Towards automatic speech recognition without pronunciation dictionary, transcribed speech and text resources in the target language using cross-lingual word-to-phoneme alignment}},
  year=2014,
  booktitle={Proc. 4th Workshop on Spoken Language Technologies for Under-Resourced Languages  (SLTU 2014)},
  pages={73--80}
}