Odyssey 2012 - The Speaker and Language Recognition Workshop

Singapore
June 25-28, 2012

Audio Context Recognition in Variable Mobile Environments from Short Segments Using Speaker and Language Recognizers

Tomi Kinnunen (1), Rahim Saeidi (2), Jussi Leppänen (3), Jukka P. Saarinen (3)

(1) School of Computing, University of Eastern Finland (UEF), Joensuu, Finland
(2) Centre for Language and Speech Technology, Radboud University Nijmegen, the Netherlands
(3) Nokia Research Center (NRC), Tampere, Finland

The problem of context recognition from mobile audio data is considered. We consider ten different audio contexts (such as car, bus, office and outdoors) prevalent in daily life situations. We choose mel-frequency cepstral coefficient (MFCC) parametrization and present an extensive comparison of six different classifiers: k-nearest neighbor (kNN), vector quantization (VQ), Gaussian mixture model trained with both maximum likelihood (GMM-ML) and maximum mutual information (GMM-MMI) criteria, GMM supervector support vector machine (GMM-SVM) and, finally, SVM with generalized linear discriminant sequence (GLDS-SVM). After all parameter optimizations, GMM-MMI and and VQ classifiers perform the best with 52.01 %, and 50.34 % context identification rates, respectively, using 3-second data records. Our analysis reveals further that none of the six classifiers is superior to each other when class-, user- or phone-specific accuracies are considered.

Full Paper

Bibliographic reference.  Kinnunen, Tomi / Saeidi, Rahim / Leppänen, Jussi / Saarinen, Jukka P. (2012): "Audio context recognition in variable mobile environments from short segments using speaker and language recognizers", In Odyssey-2012, 304-311.