In previous work, we proposed a model for speech-to-speech translation that is sensitive to paralinguistic information such as duration and power of spoken words. This model uses linear regression to map source acoustic features to target acoustic features directly and in continuous space. However, while the model is effective, it faces scalability issues as a single model must be trained for every word, which makes it difficult to generalize to words for which we do not have parallel speech. In this work we first demonstrate that simply training a linear regression model on all words is not sufficient to express paralinguistic translation. We next describe a neural network model that has sufficient expressive power to perform paralinguistic translation with a single model. We evaluate the proposed method on a digit translation task and show that we achieve similar results with a single neural network-based model as previous work did using word-dependent models.
Bibliographic reference. Kano, Takatomo / Takamichi, Shinnosuke / Sakti, Sakriani / Neubig, Graham / Toda, Tomoki / Nakamura, Satoshi (2013): "Generalizing continuous-space translation of paralinguistic information", In INTERSPEECH-2013, 2614-2618.