SLaTE 2015 - Workshop on Speech and Language Technology in Education
Automatic speech recognition (ASR) for childrens speech is more difficult than for adults speech. This paper explores two explanations of this phenomenon, namely (A) that it is due to predictable phonological effects associated with language acquisition in children, or (B) that it is due to the general increase in acoustic variability that has been observed in childrens speech. Phone recognition experiments are conducted on hand labelled data for children aged between 5 and 6. A statistical comparison of the resulting confusion matrix with that for adult speech (TIMIT) shows significant increases in phone substitution rates for children, some of which correspond to established phonological phenomena (type A errors). However these only account for a small proportion of errors, and those associated with general acoustic variability (type B) appear to account for the majority. The study also shows significantly more deletion errors in ASR for childrens speech. Overall, the results suggest that attempts to improve ASR accuracy for childrens speech by accommodating phonological phenomena associated with language acquisition, for example by changing the pronunciation dictionary, are unlikely to deliver significant success in the short term, and that coping with the increased acoustic variability in childrens speech should be the immediate priority.
Bibliographic reference. Fringi, Eva / Lehman, Jill Fain / Russell, Martin (2015): "Analysis of phone errors in computer recognition of childrens speech", In SLaTE-2015, 101-105.