First International Conference on Spoken Language Processing (ICSLP 90)
This paper discusses the problem of unknown contexts in the training data set, which often arise in the context dependent approach to speech modeling for speech recognition, and gives a solution by a phoneme environment clustering (PEC) algorithm. A context (phoneme environment)-dependent phoneme model approach is supposed to be very helpful both in speech recognition and speech synthesis. However, it introduces a very difficult problem, namely that the variety of contexts is often too large to obtain from the training data and, even if this is possible, the number of samples for each context may be too small to train its HMM phonetic model. PEC is one possible solution to the problem. It is a general framework for handling phoneme context (or, more generally, phoneme environments) which allows arbitrary number of clusters of phoneme varieties or "acoustic allophones" in minimization of some total distortion measure. Since it is based on successive binary division of an abstract space, it inherently has the ability to interpolate the unknown contexts. It is described how contextual 'holes' and 'gaps' are interpolated. This ability is proved through pattern prediction experiments and evaluation by a separability measure. This scheme is tested by an HMM-based phoneme recognition experiment on 5240-words, half of which were used for training and the other half for testing. The results have shown that the recognition error rate is reduced from 9.4% by 25 phoneme models to 3.1% using the PEC with 256 clusters.
Bibliographic reference. Sagayama, Shigeki / Honrna, Shigeru (1990): "Estimation of unknown context using a phoneme environment clustering algorithm", In ICSLP-1990, 361-364.