Speech detection is an important first step for audio analysis on media contents, whose goal is to discriminate the presence of speech from non-speech. It remains a challenge owing to various sound sources included in media audio. In this work, we present a novel audio feature extraction method to reflect the acoustic characteristic of the media audio in the time-frequency domain. Since the degree of combination of harmonic and percussive components varies depending on the type of sound source, the audio features which further distinguish between speech and non-speech can be obtained by decomposing the signal into both components. For the evaluation, we use over 20 hours of drama which manually annotated for speech detection as well as 4 full-length movies with annotations released for a research community, whose total length is over 8 hours. Experimental results with deep neural network show superior performance of the proposed in media audio condition.
Cite as: Jang, I., Ahn, C., Seo, J., Jang, Y. (2017) Enhanced Feature Extraction for Speech Detection in Media Audio. Proc. Interspeech 2017, 479-483, doi: 10.21437/Interspeech.2017-792
@inproceedings{jang17_interspeech, author={Inseon Jang and ChungHyun Ahn and Jeongil Seo and Younseon Jang}, title={{Enhanced Feature Extraction for Speech Detection in Media Audio}}, year=2017, booktitle={Proc. Interspeech 2017}, pages={479--483}, doi={10.21437/Interspeech.2017-792} }