In order to incorporate long temporal-frequency structure for acoustic event detection, we have proposed a spectral patch based learning and representation method. The learned spectral patches were regarded as acoustic words which were further used in sparse encoding for acoustic feature representation and modeling. In our previous study, during feature encoding stage, each spectral patch was encoded independently. Considering that spectral patches taken from a time sequence should keep similar representations for neighboring patches after encoding, in this study, we propose to enhance the temporal correlation of feature representation using a temporal max-smoothing algorithm. The max-smoothing tries to pick up the maximum response in a local time window as the representative feature for detection task. We tested the new feature for automatic detection of acoustic events which were selected from lecture audio data. Experimental results showed that the temporal max-smoothing significantly improved the performance.
Bibliographic reference. Lu, Xugang / Shen, Peng / Tsao, Yu / Hori, Chiori / Kawai, Hisashi (2015): "Sparse representation with temporal max-smoothing for acoustic event detection", In INTERSPEECH-2015, 1176-1180.