Speech Recognition and Intrinsic Variation (SRIV2006)

Toulouse, France
May 20, 2006

Intra-Speaker Variation and Units in Human Speech Perception and ASR

Richard Wright

Department of Linguistics, University of Washington, Seattle, WA, USA

Recent research on human speech perception and word recognition on one hand, and automatic speech recognition on the other, has resulted in significant advances in our understanding of variation in the speech signal. One advance is the recognition that speaker dependent variation in the speech signal is largely systematic, and therefore can be treated as information rather than noise. Another is the recognition that inter-speaker and intra-speaker variation diverge significantly both in their base causes and in their acoustic characteristics. Therefore success in approaches to one type of variation may not always transfer to the other. Inter-talker variation, frequently referred to as indexical information, results from talker-dependent physiologic and anatomic influences on production. It is also the result of a myriad of talker-dependent experiences with language exposure and use such as demographic factors, regional accents, and sociolinguistic factors. Although inter-talker variability represents a significant challenge to models of speech perception and ASR, it can be addressed with corpora that are representative of the population that is being modeled (range of talker sizes, gender, ages) and appropriate language descriptions (ex: different phone representations for words that vary in pronunciation across regional accents). The reason for this is that most inter-speaker variation remains constant across speaking contexts and is shared by significant sectors of the population (ex: regional accent, speaker size, gender). Intra-talker variation results from a variety of factors including linguistic-structural (ie allophonic) effects, such as the influence of phonetic and prosodic contexts, discourse factors, lexical factors, emotion, and the talkers estimation of the listeners need for clarity and intensity in the signal (due to noise, confusion, or recognition errors). Clearly using task appropriate corpora will improve automatic recognition in the face of intra-speaker variation. However, unlike indexical information, speaker internal variation does not remain static across an utterance. It therefore requires the listener, and the ASR device, to adapt dynamically to the to the changes or to be able to predict them. Moreover, many of the changes are sub-phone in nature (addition of a feature from partial assimilation to context, partial deletions) and are therefore best modeled in terms of features rather than through the proliferation of novel phones to accommodate the new sounds that are created through the addition of a single feature. Traditional models of speech perception are based on abstract and invariant categories, such as phonemes or context sensitive allophones (equivalent to phones in ASR) and therefore are very poor at handling variation. Recent research on speech perception and word recognition suggests that retrieval of information is affected by systematic variation such as rate, speech style, reduction and hyperarticulation in response to changing informational load, or indexical information. Moreover, there is evidence that human listeners adapt dynamically to listening conditions using partial information. That is, listeners can use underspecified information in making lexical decisions, modifying the weighting of extracted features as the listening conditions change. For example listeners use a sort of coarse coding of features that group speech sounds into meta-phone groupings based on similarity distances. Feature-based representations are necessary for modeling underspecification in perceptual responses. Moreover, feature-based pronunciation models for ASR are more efficient than phone based models, are better suited for incorporation of new factors such as prosody, are better for modeling sparse data, and are better suited for dynamic adaptation to changes in the speech signal.

Full Paper
Presentation (.pdf)
Sound files can be accessed from within the presentation.

Bibliographic reference.  Wright, Richard (2006): "Intra-speaker variation and units in human speech perception and ASR", In SRIV-2006, 39-42.