September 22-25, 1997
Recently, we have developed a probabilistic framework for segment- based speech recognition that represents the speech signal as a network of segments and associated feature vectors . Although in general, each path through the network does not traverse all segments, we argued that each path must account for all feature vectors in the network. We then demonstrated an efficient search algorithm that uses a single additional model to account for segments that are not traversed. In this paper, we present two new extensions to our framework. First, we replace our acoustic segmentation algorithm with "segmentation by recognition," a probabilistic algorithm that can combine multiple contextual constraints towards hypothesizing only the most likely segments. Second, we generalize our framework to "near-miss modeling" and describe a search algorithm that can efficiently use multiple models to enforce contextual constraints across all segments in a network. We report experiments in phonetic recognition on the TIMIT corpus in which we achieve a diphone context-dependent error rate of 26.6% on the NIST core test set over 39 classes. This is a 12.8% reduction in error rate from our best previously reported result.
Bibliographic reference. Chang, Jane W. / Glass, James R. (1997): "Segmentation and modeling in segment-based recognition", In EUROSPEECH-1997, 1199-1202.