Triphone State-Tying via Deep Canonical Correlation Analysis

Weiran Wang, Hao Tang, Karen Livescu


Context-dependent phone models are used in modern speech recognition systems to account for co-articulation effects. Due to the vast number of possible context-dependent phones, state-tying is typically used to reduce the number of target classes for acoustic modeling. We propose a novel approach for state-tying which is completely data dependent and requires no domain knowledge. Our method first learns low-dimensional embeddings of context-dependent phones using deep canonical correlation analysis. The learned embeddings capture similarity between triphones and are highly predictable from the acoustics. We then cluster the embeddings and use cluster IDs as tied states. The bottleneck features of a DNN predicting the tied states achieve competitive recognition accuracy on TIMIT.


DOI: 10.21437/Interspeech.2016-1300

Cite as

Wang, W., Tang, H., Livescu, K. (2016) Triphone State-Tying via Deep Canonical Correlation Analysis. Proc. Interspeech 2016, 3444-3448.

Bibtex
@inproceedings{Wang+2016,
author={Weiran Wang and Hao Tang and Karen Livescu},
title={Triphone State-Tying via Deep Canonical Correlation Analysis},
year=2016,
booktitle={Interspeech 2016},
doi={10.21437/Interspeech.2016-1300},
url={http://dx.doi.org/10.21437/Interspeech.2016-1300},
pages={3444--3448}
}