Traditional Text-To-Speech (TTS) systems have been developed using especially-designed non-expressive scripted recordings. In order to develop a new generation of expressive TTS systems in the Simple4All project, real recordings from the media should be used for training new voices with a whole new range of speaking styles. However, for processing this more spontaneous material, the new systems must be able to deal with imperfect data (multi-speaker recordings, background and foreground music and noise), filtering out low-quality audio segments and creating mono-speaker clusters. In this paper we compare several architectures for combining speaker diarization and music and noise detection which improve the precision and overall quality of the segmentation.
Bibliographic reference. Gallardo-Antolín, A. / Montero, J. M. / King, Simon (2014): "A comparison of open-source segmentation architectures for dealing with imperfect data from the media in speech synthesis", In INTERSPEECH-2014, 2370-2374.