Traditional noise reduction systems modify a noisy signal to make it more like the original clean signal. For speech, these methods suffer from two main problems: under-suppression of noise and over-suppression of target speech. Instead, synthesizing clean speech based on the noisy signal could produce outputs that are both noise-free and high quality. Our previous work introduced such a system using concatenative synthesis, but it required processing the clean speech at run time, which was slow and not scalable. In order to make such a system scalable, we propose here learning a similarity metric using two separate networks, one network processing the clean segments offline and another processing the noisy segments at run time. This system incorporates a ranking loss to optimize for the retrieval of appropriate clean speech segments. This model is compared against our original on the CHiME2-GRID corpus, measuring ranking performance and subjective listening tests of resyntheses.
Cite as: Maiti, S., Mandel, M.I. (2017) Concatenative Resynthesis Using Twin Networks. Proc. Interspeech 2017, 3647-3651, doi: 10.21437/Interspeech.2017-1653
@inproceedings{maiti17_interspeech, author={Soumi Maiti and Michael I. Mandel}, title={{Concatenative Resynthesis Using Twin Networks}}, year=2017, booktitle={Proc. Interspeech 2017}, pages={3647--3651}, doi={10.21437/Interspeech.2017-1653} }