We study the problem of acoustic feature learning in the setting where we have access to another (non-acoustic) modality for feature learning but not at test time. We use deep variational canonical correlation analysis (VCCA), a recently proposed deep generative method for multi-view representation learning. We also extend VCCA with improved latent variable priors and with adversarial learning. Compared to other techniques for multi-view feature learning, VCCA’s advantages include an intuitive latent variable interpretation and a variational lower bound objective that can be trained end-to-end efficiently. We compare VCCA and its extensions with previous feature learning methods on the University of Wisconsin X-ray Microbeam Database, and show that VCCA-based feature learning improves over previous methods for speaker-independent phonetic recognition.
Cite as: Tang, Q., Wang, W., Livescu, K. (2017) Acoustic Feature Learning via Deep Variational Canonical Correlation Analysis. Proc. Interspeech 2017, 1656-1660, doi: 10.21437/Interspeech.2017-1581
@inproceedings{tang17_interspeech, author={Qingming Tang and Weiran Wang and Karen Livescu}, title={{Acoustic Feature Learning via Deep Variational Canonical Correlation Analysis}}, year=2017, booktitle={Proc. Interspeech 2017}, pages={1656--1660}, doi={10.21437/Interspeech.2017-1581} }