5th European Conference on Speech Communication and Technology

Rhodes, Greece
September 22-25, 1997

Between Recognition and Synthesis - 300 Bits/Second Speech Coding

Mohamed Ismail, Keith Ponting

DERA Malvern Speech Research Unit, Malvern, Worcs, England, UK

This paper describes a system for speech coding designed to operate at 300 bits/sec and below. A continuous speech recogniser is used to transcribe incoming speech as a sequence of sub-word units termed acoustic segments. Prosodic information is combined with segment identity to form a serial data stream suitable for transmission. A rule- based system maps segment identity and prosodic information to parameters suitable for driving a parallel formant speech synthesiser. Acoustic segment Hidden Markov Models (HMMs) are shown to perform as well as conventional phone HMMs during recognition. A segment error rate of 3.8 % was achieved in a speaker-dependent, task-dependent configuration. An average data rate of 262 bits/sec was obtained. Speech from the synthesiser was better than obtainable from a purely textual representation though not as good as 2400 bit/sec Linear Predictive Coding (LPC) vocoded speech.

Full Paper

Bibliographic reference.  Ismail, Mohamed / Ponting, Keith (1997): "Between recognition and synthesis - 300 bits/second speech coding", In EUROSPEECH-1997, 441-444.