Interspeech'2005 - Eurospeech
Speech acoustics varies from speaker to speaker, microphone to microphone, room to room, line to line, etc. Physically speaking, every speech sample is distorted. Socially speaking, however, speech is the easiest communication media for humans. In order to cope with the inevitable distortions, speech engineers have built HMMs with speech data of hundreds or thousands of speakers and the models are called speaker-independent models. But they often need to be adapted to the input speaker or environment and this fact claims that the speaker-independent models are not really speaker-independent. Recently, a novel acoustic representation of speech was proposed, where dimensions of the above distortions can hardly be seen. It discards every acoustic substance of speech and captures only their interrelations to represent speech acoustics structurally. The new representation can be interpreted linguistically as physical implementation of structural phonology and also psychologically as speech Gestalt. In this paper, the first recognition experiment was carried out to investigate the performance of the new representation. The results showed that the new models trained from a single speaker with no normalization can outperform the conventional models trained from 4,130 speakers with CMN.
Bibliographic reference. Murakami, Takao / Maruyama, Kazutaka / Minematsu, Nobuaki / Hirose, Keikichi (2005): "Japanese vowel recognition based on structural representation of speech", In INTERSPEECH-2005, 1261-1264.