Multimodal Name Recognition in Live TV Subtitling

Marek Hrúz, Aleš Pražák, Michal Bušta


In this paper, we present a method of combining a visual text reader with a system of automatic speech recognition to suppress errors when encountering out-of-vocabulary words - specifically names. The visual text reader outputs detected words that are mapped into a large list of names via the Levenshtein distance. The detected names are inserted into the class-based language model on the fly which improves recognition results. To demonstrate the effect on the real speech recognition task we use data from sports TV broadcasting where a lot of names are present in both the audio and video streams. We superseded manual vocabulary editing in live TV subtitling through respeaking by an automated online process. Further, we show that automatically adding the names to the recognition vocabulary online and with forgetting lowers the WER relatively by 39% in comparison with the case when names of all sportsmen are added to the vocabulary beforehand and by 15% when only the relevant names are added beforehand.


 DOI: 10.21437/Interspeech.2018-1748

Cite as: Hrúz, M., Pražák, A., Bušta, M. (2018) Multimodal Name Recognition in Live TV Subtitling. Proc. Interspeech 2018, 3529-3532, DOI: 10.21437/Interspeech.2018-1748.


@inproceedings{Hrúz2018,
  author={Marek Hrúz and Aleš Pražák and Michal Bušta},
  title={Multimodal Name Recognition in Live TV Subtitling},
  year=2018,
  booktitle={Proc. Interspeech 2018},
  pages={3529--3532},
  doi={10.21437/Interspeech.2018-1748},
  url={http://dx.doi.org/10.21437/Interspeech.2018-1748}
}