Acoustic models in state-of-the-art LVCSR systems are typically trained on data from thousands of speakers and then adapted to a speaker using, e.g., various combinations of CMLLR, MLLR and MAP. This adaptation step is particularly important for speakers with accents that are not well represented in the training set. The present study explores how to improve performance on South-Asian-accented speakers (SoA-accented) with the availability of thousands of US-accented, hundreds of SoA-accented, and tens of hours of speaker-specific training data. We employ a decision tree similarity measure to analyze how varying co-articulations across accents and people manifest themselves in the decision tree. Modeling these variations in addition to adapting the GMMs of an existing baseline system to a speaker improved WER for small systems (1k GMMs), but improvement for systems with larger trees (2k, 3k GMMs) was modest. Overall, GMM adaptation/retraining yields significant performance benefits, and training a SoA-accent-specific system is particularly worthwhile when lacking speaker adaptation data.
Bibliographic reference. Telaar, Dominic / Fuhs, Mark C. (2013): "Accent- and speaker-specific polyphone decision trees for non-native speech recognition", In INTERSPEECH-2013, 3313-3316.