Discrimination between shouted and normal speech is crucial in audio surveillance and monitoring. Although deep neural networks are used in recent methods, traditional low-level speech features are applied, such as mel-frequency cepstral coefficients and the mel spectrum. This paper presents a deep spectral-cepstral fusion approach that learns descriptive features for target classification from high-dimensional spectrograms and cepstrograms. We compare the following three types of architectures as base networks: convolutional neural networks (CNNs), gated recurrent unit (GRU) networks, and their combination (CNN-GRU). Using a corpus comprising real shouts and speech, we present a comprehensive comparison with conventional methods to verify the effectiveness of the proposed feature learning method. The results of experiments conducted in various noisy environments demonstrate that the CNN-GRU based on our spectral-cepstral features achieves better classification performance than single feature-based networks. This finding suggests the effectiveness of using high-dimensional sources for speech-type recognition in sound event detection.
Cite as: Fukumori, T. (2021) Deep Spectral-Cepstral Fusion for Shouted and Normal Speech Classification. Proc. Interspeech 2021, 4174-4178, doi: 10.21437/Interspeech.2021-1245
@inproceedings{fukumori21_interspeech, author={Takahiro Fukumori}, title={{Deep Spectral-Cepstral Fusion for Shouted and Normal Speech Classification}}, year=2021, booktitle={Proc. Interspeech 2021}, pages={4174--4178}, doi={10.21437/Interspeech.2021-1245} }