INTERSPEECH 2004 - ICSLP
In this work we deal with a speech coder which is based on a model of the human peripheral auditory system and uses a neural auditory representation as its code. This representation consists of multi-channel sparse pulse trains but is still highly over-complete and therefore not efficient in terms of its data compression capability. The emphasis of this paper is on answering the question 'How sparse can we make the auditory representation?', i.e., on finding a bound for the number of pulses which can be omitted without degrading the quality of the reconstructed speech signal. For this purpose we incorporate a second auditory model which allows to decide whether a pulse is needed or not based on the excitation pattern caused by the signal when a single pulse is resynthesized. This model accounts for both simultaneous and temporal masking. We also propose a method for compensating for the loss of energy due to the elimination of pulses which makes a possible spectral distortion inaudible. Results show that about 74% of the pulses can be omitted while maintaining the original speech quality.
Bibliographic reference. Feldbauer, Christian / Kubin, Gernot (2004): "How sparse can we make the auditory representation of speech?", In INTERSPEECH-2004, 1997-2000.