Motivated by the speaker-specificity and stationarity of subglottal acoustics, this paper investigates the utility of subglottal cepstral coefficients (SGCCs) for speaker identification (SID) and verification (SV). SGCCs can be computed using accelerometer recordings of subglottal acoustics, but such an approach is infeasible in real-world scenarios. To estimate SGCCs from speech signals, we adopt the Bayesian minimum mean squared error (MMSE) estimator proposed in the speech-to-articulatory inversion literature. The joint distribution of SGCCs and speech MFCCs is modeled using theWashU-UCLA corpus (containing simultaneous recordings of speech and subglottal acoustics), and the resulting model is used to obtain an MMSE estimate of SGCCs from unseen (test) MFCCs. Cross-validation experiments on the WashU-UCLA corpus show that the estimation efficacy, on average, is speaker dependent. A score-level fusion of MFCC and SGCC systems outperforms the MFCC-only baseline in both SID and SV tasks. On the TIMIT database (SID), the relative reduction in identification error is 16, 40 and 51% for G.712-filtered (3003400 Hz), narrowband (04000 Hz) and wideband (08000 Hz) speech, respectively. On the NIST 2008 database (SV), the relative reduction in equal error rate is 4 and 11% for 10 and 5 second utterances, respectively.
Bibliographic reference. Arsikere, Harish / Gupta, Hitesh Anand / Alwan, Abeer (2014): "Speaker recognition via fusion of subglottal features and MFCCs", In INTERSPEECH-2014, 1106-1110.