5th International Conference on Spoken Language Processing
This paper describes experimental results on whole word HMM-based speech recognition of connected digits in Japanese collected through the telephone network. The training data comprises 756860 digits uttered by 1963 speakers, while the testing data comprises 304212 digits uttered by 852 speakers. The best performance was a word error rate of 0.42% for known length strings obtained using context dependent models. The word error rate was measured as a function of the training data size. The result showed that at least 3302 samples per speaker and 344 speakers are necessary and sufficient for context independent training. Error analysis was conducted on a fraction of the population bearing the major part of recognition errors. The results suggested that such speakers arise not simply from speaker characteristics but from a combination of speaker characteristics and environmental conditions.
Bibliographic reference. Kawai, Hisashi / Higuchi, Norio (1998): "Recognition of connected digit speech in Japanese collected over the telephone network", In ICSLP-1998, paper 0694.