INTERSPEECH 2004 - ICSLP
To improve the performance and the usability of the speech recognition devices, it is necessary for most applications to allow users to enter new words or personalize words in the system vocabulary. The voice-tagging technique is a simple example of using speaker dependent spoken samples to generate baseform transcriptions of the spoken words. More sophisticated techniques can use both spoken samples and text versions of the new words to generate baseform transcriptions. In this paper, we propose a maximum context tree (MCT) based approach to the problem. Comparison is made to the common decision tree based method and Pronunciation by Analogy (PbA) approach. The new approach gives exact baseform transcription for in-vocabulary words and it shows better performance than decision tree. It performs significantly better than PbA approach with less memory usages. MCT uses the word segment probability rather than frequency count used in PbA. MCT uses the full context for the focus letter to overcome the some deficiencies in the PbA approach.
Bibliographic reference. Ma, Changxue (2004): "Automatic phonetic base form generation based on maximum context tree", In INTERSPEECH-2004, 457-460.