INTERSPEECH 2007
8th Annual Conference of the International Speech Communication Association

Antwerp, Belgium
August 27-31, 2007

Text Island Spotting in Large Speech Databases

B. Lecouteux (1), Georges Linarès (1), Frédéric Beaugendre (2), Pascal Nocera (1)

(1) LIA, France
(2) Voice-Insight, Belgium

This paper addresses the problem of using journalist prompts or closed captions to build corpora for training speech recognition systems. Generally, these text documents are imperfect transcripts which suffer from the lack of timestamps. We propose a method combining a driven decoding algorithm and a fast-match process allowing to spot text-segments. This method is evaluated both on the French ESTER ([1]) corpus and on a large database composed of records from the Radio Television Belge Francophone (RTBF) associated to real prompts. Results show very good performance in terms of spotting; we observed a F-measure of about 98% on spotting the real text island provided by the RTBF corpus. Moreover, the decoding driven by the imperfect transcript island outperforms significantly the baseline system.

Full Paper

Bibliographic reference.  Lecouteux, B. / Linarès, Georges / Beaugendre, Frédéric / Nocera, Pascal (2007): "Text island spotting in large speech databases", In INTERSPEECH-2007, 1318-1321.