15th Annual Conference of the International Speech Communication Association

September 14-18, 2014

A Comparison of Open-Source Segmentation Architectures for Dealing with Imperfect Data from the Media in Speech Synthesis

A. Gallardo-Antolín (1), J. M. Montero (2), Simon King (3)

(1) Universidad Carlos III de Madrid, Spain
(2) Universidad Politécnica de Madrid, Spain
(3) University of Edinburgh, UK

Traditional Text-To-Speech (TTS) systems have been developed using especially-designed non-expressive scripted recordings. In order to develop a new generation of expressive TTS systems in the Simple4All project, real recordings from the media should be used for training new voices with a whole new range of speaking styles. However, for processing this more spontaneous material, the new systems must be able to deal with imperfect data (multi-speaker recordings, background and foreground music and noise), filtering out low-quality audio segments and creating mono-speaker clusters. In this paper we compare several architectures for combining speaker diarization and music and noise detection which improve the precision and overall quality of the segmentation.

Full Paper

Bibliographic reference.  Gallardo-Antolín, A. / Montero, J. M. / King, Simon (2014): "A comparison of open-source segmentation architectures for dealing with imperfect data from the media in speech synthesis", In INTERSPEECH-2014, 2370-2374.