Concatenative Resynthesis Using Twin Networks

Soumi Maiti, Michael I. Mandel


Traditional noise reduction systems modify a noisy signal to make it more like the original clean signal. For speech, these methods suffer from two main problems: under-suppression of noise and over-suppression of target speech. Instead, synthesizing clean speech based on the noisy signal could produce outputs that are both noise-free and high quality. Our previous work introduced such a system using concatenative synthesis, but it required processing the clean speech at run time, which was slow and not scalable. In order to make such a system scalable, we propose here learning a similarity metric using two separate networks, one network processing the clean segments offline and another processing the noisy segments at run time. This system incorporates a ranking loss to optimize for the retrieval of appropriate clean speech segments. This model is compared against our original on the CHiME2-GRID corpus, measuring ranking performance and subjective listening tests of resyntheses.


 DOI: 10.21437/Interspeech.2017-1653

Cite as: Maiti, S., Mandel, M.I. (2017) Concatenative Resynthesis Using Twin Networks. Proc. Interspeech 2017, 3647-3651, DOI: 10.21437/Interspeech.2017-1653.


@inproceedings{Maiti2017,
  author={Soumi Maiti and Michael I. Mandel},
  title={Concatenative Resynthesis Using Twin Networks},
  year=2017,
  booktitle={Proc. Interspeech 2017},
  pages={3647--3651},
  doi={10.21437/Interspeech.2017-1653},
  url={http://dx.doi.org/10.21437/Interspeech.2017-1653}
}