Data Independent Sequence Augmentation Method for Acoustic Scene Classification

Zhang Teng, Kailai Zhang, Ji Wu


Augmenting datasets by transforming inputs in a way such as vocal tract length perturbation (VTLP) is a crucial ingredient of the state of the art methods for speech recognition tasks. In contrast to speech, sounds coming from realistic environments have no speaker to speaker variations. Thus VTLP is invalid for acoustic scene classification tasks. This paper investigates a novel sequence augmentation method for long short-term memory (LSTM) acoustic modeling to deal with data sparsity in acoustic scene classification tasks. The audio sequences are randomly rearranged and concatenated during training, but at test time, a prediction is made by the original audio sequence. The rearrangement is well-designed to adapt to the long short-term dependency in LSTM models. Experiments on acoustic scene classification task show performance improvements of the proposed methods. The classification errors in LITIS ROUEN dataset and DCASE2016 dataset are reduced by 18.1% and 6.4% relatively.


 DOI: 10.21437/Interspeech.2018-1250

Cite as: Teng, Z., Zhang, K., Wu, J. (2018) Data Independent Sequence Augmentation Method for Acoustic Scene Classification. Proc. Interspeech 2018, 3289-3293, DOI: 10.21437/Interspeech.2018-1250.


@inproceedings{Teng2018,
  author={Zhang Teng and Kailai Zhang and Ji Wu},
  title={Data Independent Sequence Augmentation Method for Acoustic Scene Classification},
  year=2018,
  booktitle={Proc. Interspeech 2018},
  pages={3289--3293},
  doi={10.21437/Interspeech.2018-1250},
  url={http://dx.doi.org/10.21437/Interspeech.2018-1250}
}