Using Approximated Auditory Roughness as a Pre-Filtering Feature for Human Screaming and Affective Speech AED

Di He, Zuofu Cheng, Mark Hasegawa-Johnson, Deming Chen


Detecting human screaming, shouting, and other verbal manifestations of fear and anger are of great interest to security Audio Event Detection (AED) systems. The Internet of Things (IoT) approach allows wide-covering, powerful AED systems to be distributed across the Internet. But a good feature to pre-filter the audio is critical to these systems. This work evaluates the potential of detecting screaming and affective speech using Auditory Roughness and proposes a very light-weight approximation method. Our approximation uses a similar amount of Multiple Add Accumulate (MAA) compared to short-term energy (STE), and at least 10× less MAA than MFCC. We evaluated the performance of our approximated roughness on the Mandarin Affective Speech corpus and a subset of the Youtube AudioSet for screaming against other low-complexity features. We show that our approximated roughness returns higher accuracy.


 DOI: 10.21437/Interspeech.2017-593

Cite as: He, D., Cheng, Z., Hasegawa-Johnson, M., Chen, D. (2017) Using Approximated Auditory Roughness as a Pre-Filtering Feature for Human Screaming and Affective Speech AED. Proc. Interspeech 2017, 1914-1918, DOI: 10.21437/Interspeech.2017-593.


@inproceedings{He2017,
  author={Di He and Zuofu Cheng and Mark Hasegawa-Johnson and Deming Chen},
  title={Using Approximated Auditory Roughness as a Pre-Filtering Feature for Human Screaming and Affective Speech AED},
  year=2017,
  booktitle={Proc. Interspeech 2017},
  pages={1914--1918},
  doi={10.21437/Interspeech.2017-593},
  url={http://dx.doi.org/10.21437/Interspeech.2017-593}
}