This paper describes a new and improved method for the framework of structure to speech conversion we previously proposed. Most of the speech synthesizers take a phoneme sequence as input and generate speech by converting each of the phonemes into its corresponding sound. In other words, they simulate a human process of reading text out. However, infants usually acquire speech communication ability without text or phoneme sequences. Since their phonemic awareness is very immature, they can hardly decompose an utterance into a sequence of phones or phonemes. As developmental psychology claims, infants acquire the holistic sound patterns of words from the utterances of their parents, called word Gestalt, and they reproduce them with their vocal tubes. This behavior is called vocal imitation. In our previous studies, the word Gestalt was defined physically and a method of extracting it from a word utterance was proposed. We already applied the word Gestalt to ASR, CALL, and also speech generation, which we call structure to speech conversion. Unlike reading machines, our framework simulates infants’ vocal imitation. In this paper, a method for improving our speech generation framework based on a structural cost function is proposed and evaluated.
Full Paper Multimedia Files
Bibliographic reference. Saito, Daisuke / Qiao, Yu / Minematsu, Nobuaki / Hirose, Keikichi (2009): "Optimal event search using a structural cost function - improvement of structure to speech conversion", In INTERSPEECH-2009, 2047-2050.