Using the Bag-of-Audio-Word Feature Representation of ASR DNN Posteriors for Paralinguistic Classification

Gábor Gosztolya


The Bag-of-Audio-Word (or BoAW) representation is an utterance-level feature representation approach that was successfully applied in the past in various computational paralinguistic tasks. Here, we extend the BoAW feature extraction process with the use of Deep Neural Networks: first we train a DNN acoustic model on an acoustic dataset consisting of 22 hours of speech for phoneme identification, then we evaluate this DNN on a standard paralinguistic dataset. To construct utterance-level features from the frame-level posterior vectors, we calculate their BoAW representation. We found that this approach can be utilized even on its own, although the results obtained lag behind those of the standard paralinguistic approach, and the optimal size of the extracted feature vectors tends to be large. Our approach, however, can be easily and efficiently combined with the standard paralinguistic one, resulting in the highest Unweighted Average Recall (UAR) score achieved so far for a general paralinguistic dataset.


 DOI: 10.21437/Interspeech.2019-1163

Cite as: Gosztolya, G. (2019) Using the Bag-of-Audio-Word Feature Representation of ASR DNN Posteriors for Paralinguistic Classification. Proc. Interspeech 2019, 3940-3944, DOI: 10.21437/Interspeech.2019-1163.


@inproceedings{Gosztolya2019,
  author={Gábor Gosztolya},
  title={{Using the Bag-of-Audio-Word Feature Representation of ASR DNN Posteriors for Paralinguistic Classification}},
  year=2019,
  booktitle={Proc. Interspeech 2019},
  pages={3940--3944},
  doi={10.21437/Interspeech.2019-1163},
  url={http://dx.doi.org/10.21437/Interspeech.2019-1163}
}