Speech Augmentation via Speaker-Specific Noise in Unseen Environment

Ya’nan Guo, Ziping Zhao, Yide Ma, Björn W. Schuller


Speech augmentation is a common and effective strategy to avoid overfitting and improve on the robustness of an emotion recognition model. In this paper, we investigate for the first time the intrinsic attributes in a speech signal using the multi-resolution analysis theory and the Hilbert-Huang Spectrum, with the goal of developing a robust speech augmentation approach from raw speech data. Specifically, speech decomposition in a double tree complex wavelet transform domain is realized, to obtain sub-speech signals; then, the Hilbert Spectrum using Hilbert-Huang Transform is calculated for each sub-band to capture the noise content in unseen environments with the voice restriction to 100–4000 Hz; finally, the speech-specific noise that varies with the speaker individual, scenarios, environment, and voice recording equipment, can be reconstructed from the top two high-frequency sub-bands to enhance the raw signal. Our proposed speech augmentation is demonstrated using five robust machine learning architectures based on the RAVDESS database, achieving up to 9.3% higher accuracy compared to the performance on raw data for an emotion recognition task.


 DOI: 10.21437/Interspeech.2019-2712

Cite as: Guo, Y., Zhao, Z., Ma, Y., Schuller, B.W. (2019) Speech Augmentation via Speaker-Specific Noise in Unseen Environment. Proc. Interspeech 2019, 1781-1785, DOI: 10.21437/Interspeech.2019-2712.


@inproceedings{Guo2019,
  author={Ya’nan Guo and Ziping Zhao and Yide Ma and Björn W. Schuller},
  title={{Speech Augmentation via Speaker-Specific Noise in Unseen Environment}},
  year=2019,
  booktitle={Proc. Interspeech 2019},
  pages={1781--1785},
  doi={10.21437/Interspeech.2019-2712},
  url={http://dx.doi.org/10.21437/Interspeech.2019-2712}
}