Multimodal Fusion of Multirate Acoustic, Prosodic, and Lexical Speaker Characteristics for Native Language Identification

Prashanth Gurunath Shivakumar, Sandeep Nallan Chakravarthula, Panayiotis Georgiou


Native language identification from acoustic signals of L2 speakers can be useful in a range of applications such as informing automatic speech recognition (ASR), speaker recognition, and speech biometrics. In this paper we follow a multi-stream and multi-rate approach, for native language identification, in feature extraction, classification, and fusion. On the feature front we employ acoustic features such as MFCC and PLP features, at different time scales and different transformations; we evaluate speaker normalization as a feature and as a transform; investigate phonemic confusability and its interplay with paralinguistic cues at both the frame and phone-level temporal scales; and automatically extract lexical features; in addition to baseline features. On the classification side we employ SVM, i-Vector, DNN and bottleneck features, and maximum-likelihood models. Finally we employ fusion for system combination and analyze the complementarity of the individual systems. Our proposed system significantly outperforms the baseline system on both development and test sets.


DOI: 10.21437/Interspeech.2016-1312

Cite as

Shivakumar, P.G., Chakravarthula, S.N., Georgiou, P. (2016) Multimodal Fusion of Multirate Acoustic, Prosodic, and Lexical Speaker Characteristics for Native Language Identification. Proc. Interspeech 2016, 2408-2412.

Bibtex
@inproceedings{Shivakumar+2016,
author={Prashanth Gurunath Shivakumar and Sandeep Nallan Chakravarthula and Panayiotis Georgiou},
title={Multimodal Fusion of Multirate Acoustic, Prosodic, and Lexical Speaker Characteristics for Native Language Identification},
year=2016,
booktitle={Interspeech 2016},
doi={10.21437/Interspeech.2016-1312},
url={http://dx.doi.org/10.21437/Interspeech.2016-1312},
pages={2408--2412}
}