This paper proposes a method to modify the expression or emotion in a sample of speech without altering the speaker's identity. The method exploits a statistical speech model that factorises the speaker identity from expressions using linear transforms. For this approach, the set of transforms that best fit the speaker and expression of the input speech sample are learned. They are then combined with the expression transforms of the desired expression taken from another speaker. Since the combined expression transform is factorised and contains information about expression only, it may be applied to the original speech sample to modify its expression to the desired one without altering the identity of the speaker. Notably, this method may be applied universally to any voice without the need for a parallel training corpus.
Bibliographic reference. Latorre, Javier / Wan, Vincent / Yanagisawa, Kayoko (2014): "Voice expression conversion with factorised HMM-TTS models", In INTERSPEECH-2014, 1514-1518.