Enhanced Feature Extraction for Speech Detection in Media Audio

Inseon Jang, ChungHyun Ahn, Jeongil Seo, Younseon Jang


Speech detection is an important first step for audio analysis on media contents, whose goal is to discriminate the presence of speech from non-speech. It remains a challenge owing to various sound sources included in media audio. In this work, we present a novel audio feature extraction method to reflect the acoustic characteristic of the media audio in the time-frequency domain. Since the degree of combination of harmonic and percussive components varies depending on the type of sound source, the audio features which further distinguish between speech and non-speech can be obtained by decomposing the signal into both components. For the evaluation, we use over 20 hours of drama which manually annotated for speech detection as well as 4 full-length movies with annotations released for a research community, whose total length is over 8 hours. Experimental results with deep neural network show superior performance of the proposed in media audio condition.


 DOI: 10.21437/Interspeech.2017-792

Cite as: Jang, I., Ahn, C., Seo, J., Jang, Y. (2017) Enhanced Feature Extraction for Speech Detection in Media Audio. Proc. Interspeech 2017, 479-483, DOI: 10.21437/Interspeech.2017-792.


@inproceedings{Jang2017,
  author={Inseon Jang and ChungHyun Ahn and Jeongil Seo and Younseon Jang},
  title={Enhanced Feature Extraction for Speech Detection in Media Audio},
  year=2017,
  booktitle={Proc. Interspeech 2017},
  pages={479--483},
  doi={10.21437/Interspeech.2017-792},
  url={http://dx.doi.org/10.21437/Interspeech.2017-792}
}