Analysis of Mismatched Transcriptions Generated by Humans and Machines for Under-Resourced Languages

Van Hai Do, Nancy F. Chen, Boon Pang Lim, Mark Hasegawa-Johnson


When speech data with native transcriptions are scarce in an under-resourced language, automatic speech recognition (ASR) must be trained using other methods. Semi-supervised learning first labels the speech using ASR from other languages, then re-trains the ASR using the generated labels. Mismatched crowdsourcing asks crowd-workers unfamiliar with the language to transcribe it. In this paper, self-training and mismatched crowdsourcing are compared under exactly matched conditions. Specifically, speech data of the target language are decoded by the source language ASR systems into source language phone/word sequences. We find that (1) human mismatched crowdsourcing and cross-lingual ASR have similar error patterns, but different specific errors. (2) These two sources of information can be usefully combined in order to train a better target-language ASR. (3) The differences between the error patterns of non-native human listeners and non-native ASR are small, but when differences are observed, they provide information about the relationship between the phoneme systems of the annotator/source language (Mandarin) and the target language (Vietnamese).


DOI: 10.21437/Interspeech.2016-736

Cite as

Do, V.H., Chen, N.F., Lim, B.P., Hasegawa-Johnson, M. (2016) Analysis of Mismatched Transcriptions Generated by Humans and Machines for Under-Resourced Languages. Proc. Interspeech 2016, 3863-3867.

Bibtex
@inproceedings{Do+2016,
author={Van Hai Do and Nancy F. Chen and Boon Pang Lim and Mark Hasegawa-Johnson},
title={Analysis of Mismatched Transcriptions Generated by Humans and Machines for Under-Resourced Languages},
year=2016,
booktitle={Interspeech 2016},
doi={10.21437/Interspeech.2016-736},
url={http://dx.doi.org/10.21437/Interspeech.2016-736},
pages={3863--3867}
}