In this paper an effective technique to train an acoustic model from large and unsynchronized audio and text chunks is presented. Given such a speech corpus, an algorithm to automatically segment each chunk into smaller fragments and to synchronize those to the corresponding text is defined. These smaller fragments are more suitable to be used in standard model training algorithms for usage in automatic speech recognition systems. The proposed approach is particularly suitable to bootstrap language models without relying neither on specialized training material nor borrowing from models trained for other similar languages. Extensive experimentation using the CMU Sphinx 4 recognizer and the SphinxTrain model generator in a setting designed for large-vocabulary continuous speech recognition shows the effectiveness of the approach.
Bibliographic reference. Alessandrini, Michele / Biagetti, Giorgio / Curzi, Alessandro / Turchetti, Claudio (2011): "Semi-automatic acoustic model generation from large unsynchronized audio and text chunks", In INTERSPEECH-2011, 1681-1684.