We propose a multi-accent deep neural network acoustic model with an accent-specific top layer and shared bottom hidden layers. The accent-specific top layer is used to model the distinct accent specific patterns. The shared bottom hidden layers allow maximum knowledge sharing between the native and the accent models. This design is particularly attractive when considering deploying such a system to a live speech service due to its computational efficiency. We applied the KL-divergence (KLD) regularized model adaptation to train the accent-specific top layer. On the mobile short message dictation task (SMD), with 1K, 10K, and 100K British or Indian accent adaptation utterances, the proposed approach achieves 18.1%, 26.0%, and 28.5% or 16.1%, 25.4%, and 30.6% word error rate reduction (WERR) for the British and the Indian accent respectively against a baseline cross entropy (CE) model trained from 400 hour data. On the 100K utterance accent adaptation setup, comparable performance gain can be obtained against a baseline CE model trained with 2000 hour data. We observe smaller yet significant WER reduction on a baseline model trained using the MMI sequence-level criterion.
Bibliographic reference. Huang, Yan / Yu, Dong / Liu, Chaojun / Gong, Yifan (2014): "Multi-accent deep neural network acoustic model with accent-specific top layer using the KLD-regularized model adaptation", In INTERSPEECH-2014, 2977-2981.