ISCA Workshop on Multilingual Speech and Language Processing (MULTILING 2006)

Center for Language and Speech Technology, Stellenbosch University, Stellenbosch, South Africa
April 9-11, 2006

Is it Possible to Train a Speech Recognition System on text only?

Enrico Rubagotti

UCD School of Computer Science and Informatics, UCD, Dublin, Ireland

According to speech recognition literature, one cause of recognition error is the difference in training and testing conditions. One cause of this is the use of speakers with different accents in training and testing. This is because, in the stochastic and deterministic approaches, the system is trained on pairs of acoustic signal- linguistic units. This paper describes the development of a training system that employs only graphemes and studies the feasibility of a model that employs the speech signal, a bigram model, frequencies of four grams and a distance measure of a text from a specific language to recognize speech. This system should be independent of variations in pronunciation and employable in languages for which a corpus has not yet been developed. A model was specified in the class of shallow languages and an experiment was carried out using a phonotypical transcription in Italian with a 22% WER. The input of the system was not the acoustic signal but phonemes to reduce the computational complexity in this preliminary phase. The algorithm employed in the test maps from phonemes to graphemes using a map that dynamically changes to minimise the distance of the output from the expected language. The difference between conventional phoneme parsing and our method is that in the conventional method the mapping phoneme grapheme is given before the recognition procedure, whereas in our method the map that is chosen is the one that minimises the distance between the output and the expected language.

Full Paper

Bibliographic reference.  Rubagotti, Enrico (2006): "Is it possible to train a speech recognition system on text only?", In MULTILING-2006, paper 020.