This paper presents a simple count-based approach to learning word vector representations by leveraging statistics of co-occurrences between text and speech. This type of representation requires two discrete sequences of units defined across modalities. Two possible methods for the discretization of an acoustic signal are presented, which are then applied to fundamental frequency and energy contours of a transcribed corpus of speech, yielding a sequence of textual objects (e.g. words, syllables) aligned with a sequence of discrete acoustic events. Constructing a matrix recording the co-occurrence of textual objects with acoustic events and reducing its dimensionality with matrix decomposition results in a set of context-independent representations of word types. These are applied to the task of acoustic modelling for speech synthesis; objective and subjective results indicate that these representations are useful for the generation of acoustic parameters in a text-to-speech (TTS) system. In general, we observe that the more discretization approaches, acoustic signals, and levels of linguistic analysis are incorporated into a TTS system via these count-based representations, the better that TTS system performs.
Cite as: Ribeiro, M.S., Watts, O., Yamagishi, J. (2017) Learning Word Vector Representations Based on Acoustic Counts. Proc. Interspeech 2017, 799-803, doi: 10.21437/Interspeech.2017-1340
@inproceedings{ribeiro17_interspeech, author={M. Sam Ribeiro and Oliver Watts and Junichi Yamagishi}, title={{Learning Word Vector Representations Based on Acoustic Counts}}, year=2017, booktitle={Proc. Interspeech 2017}, pages={799--803}, doi={10.21437/Interspeech.2017-1340} }