It is well known that speaker dependent acoustic models can achieve an error rate that is up to a factor of two smaller compared to well trained speaker independent acoustic models. Thus, for improved accuracy, many modern dictation systems require the user to perform enrollment sessions to adapt the acoustic model of the system. In this paper, we present an approach that uses as few as three sentences from the test speaker to select N closest speakers (cohorts) from both the original training set and newly available training speakers to construct customized models. By using such an approach, our adaptation scheme can be updated online without re-configuring anything that has been calculated before. When applying this approach to address dialectal differences, the cohort based user specific models constructed with 3 user sentences can obtain a lower error rate even when compared to user-adapted models based on 170 user sentences.
Cite as: Wu, J., Chang, E. (2001) Cohorts based custom models for rapid speaker and dialect adaptation. Proc. 7th European Conference on Speech Communication and Technology (Eurospeech 2001), 1261-1264, doi: 10.21437/Eurospeech.2001-327
@inproceedings{wu01_eurospeech, author={Jian Wu and Eric Chang}, title={{Cohorts based custom models for rapid speaker and dialect adaptation}}, year=2001, booktitle={Proc. 7th European Conference on Speech Communication and Technology (Eurospeech 2001)}, pages={1261--1264}, doi={10.21437/Eurospeech.2001-327} }