![]() |
Robust Speech Recognition for Unknown Communication ChannelsPont-à-Mousson, France |
![]() |
Current-generation speech recognition systems seek
to identify words via analysis of their underlying
phonological constituents. Although this stratagem
works well for carefully enunciated speech emanating
from a pristine acoustic environment, it has fared less
well for recognizing speech spoken under more realistic
conditions, such as
(1) moderate to high levels of background noise
(2) moderately reverberant acoustic environments
(3) spontaneous, informal conversation
Under such "real-world" conditions the acoustic
properties of speech make it difficult to partition the
acoustic stream into readily definable phonological units,
thus rendering the process of word recognition highly
vulnerable to departures from "canonical" patterns.
Analysis of informal, spontaneous speech indicates
that the stability of linguistic representation is more
likely to reside on the syllabic and phrasal levels than on
the phonological. In consequence, attempts to represent
words merely as sequences of phones, and to derive
meaning from simple chains of lexical entities, are
unlikely to yield high levels of recognition performance
under such real-world conditions.
A multi-tiered representation of speech is proposed, one in which only partial information from each of many levels of linguistic abstraction is required for sufficient identification of lexical and phrasal elements. Such tiers of linguistic abstraction are unified through a hierarchically organized process of temporal binding and are, in principle, highly tolerant of the sorts of "distortions" imposed on speech in the real world.
Bibliographic reference. Greenberg, Steven (1997): "On the origins of speech intelligibility in the real world", In RSR-1997, 23-32.