This paper addresses the problem of using journalist prompts or closed captions to build corpora for training speech recognition systems. Generally, these text documents are imperfect transcripts which suffer from the lack of timestamps. We propose a method combining a driven decoding algorithm and a fast-match process allowing to spot text-segments. This method is evaluated both on the French ESTER () corpus and on a large database composed of records from the Radio Television Belge Francophone (RTBF) associated to real prompts. Results show very good performance in terms of spotting; we observed a F-measure of about 98% on spotting the real text island provided by the RTBF corpus. Moreover, the decoding driven by the imperfect transcript island outperforms significantly the baseline system.
Bibliographic reference. Lecouteux, B. / Linarès, Georges / Beaugendre, Frédéric / Nocera, Pascal (2007): "Text island spotting in large speech databases", In INTERSPEECH-2007, 1318-1321.