The Seventh ISCA Tutorial and Research Workshop on Speech Synthesis
One of the issues in using audio books for building a synthetic voice is the segmentation of large audio files. The use of standard forced-alignment to obtain phone boundaries on large audio files fails primarily because of huge memory requirements. Earlier works have attempted to resolve this problem by using large vocabulary speech recognition system employing restricted dictionary and language model. In this work, we propose suitable modifications to the standard forced-alignment algorithm and demonstrate its usefulness for segmentation of large audio files. Experimental results are provided on audio files including an artificially created large audio file and on EMMA speech corpus of 17.5 hours. Synthetic voices are also built using these large audio files.
Index Terms: Large audio file, audio books, forced-alignment, text-to-speech
Bibliographic reference. Prahallad, Kishore / Black, Alan W. (2010): "Handling large audio files in audio books for building synthetic voices", In SSW7-2010, 148-153.