INTERSPEECH 2008
9th Annual Conference of the International Speech Communication Association

Brisbane, Australia
September 22-26, 2008

Structure to Speech Conversion - Speech Generation Based on Infant-Like Vocal Imitation

Daisuke Saito, Satoshi Asakawa, Nobuaki Minematsu, Keikichi Hirose

University of Tokyo, Japan

This paper proposes a new framework of speech generation by imitating "infants' vocal imitation". Most of the speech synthesizers take a phoneme sequence as input and generate speech by converting each of the phonemes into a sound sequentially. In other words, they simulate a human process of reading text out. However, infants usually acquire speech generation ability without text or phoneme sequences. Since their phonemic awareness is very immature, they can hardly decompose a word utterance into a sequence of phones. In this situation, as developmental psychology states, infants acquire the holistic sound pattern of words from the utterances of their parents, called word Gestalt, and they reproduce it with their vocal tubes. This behavior is called vocal imitation. In our previous studies, the word Gestalt was defined physically and a method of extracting it from an utterance was proposed and used successfully for ASR and CALL. In this paper, a method of converting the word Gestalt back to speech is proposed and evaluated. Unlike a reading machine, our proposal simulates infants' vocal imitation.

Full Paper

Bibliographic reference.  Saito, Daisuke / Asakawa, Satoshi / Minematsu, Nobuaki / Hirose, Keikichi (2008): "Structure to speech conversion - speech generation based on infant-like vocal imitation", In INTERSPEECH-2008, 1837-1840.