ISCA Archive Interspeech 2017
ISCA Archive Interspeech 2017

Multiview Representation Learning via Deep CCA for Silent Speech Recognition

Myungjong Kim, Beiming Cao, Ted Mau, Jun Wang

Silent speech recognition (SSR) converts non-audio information such as articulatory (tongue and lip) movements to text. Articulatory movements generally have less information than acoustic features for speech recognition, and therefore, the performance of SSR may be limited. Multiview representation learning, which can learn better representations by analyzing multiple information sources simultaneously, has been recently successfully used in speech processing and acoustic speech recognition. However, it has rarely been used in SSR. In this paper, we investigate SSR based on multiview representation learning via canonical correlation analysis (CCA). When both acoustic and articulatory data are available during training, it is possible to effectively learn a representation of articulatory movements from the multiview data with CCA. To further represent the complex structure of the multiview data, we apply deep CCA, where the functional form of the feature mapping is a deep neural network. This approach was evaluated in a speaker-independent SSR task using a data set collected from seven English speakers using an electromagnetic articulograph (EMA). Experimental results showed the effectiveness of the multiview representation learning via deep CCA over the CCA-based multiview approach as well as baseline articulatory movement data on Gaussian mixture model and deep neural network-based SSR systems.

doi: 10.21437/Interspeech.2017-952

Cite as: Kim, M., Cao, B., Mau, T., Wang, J. (2017) Multiview Representation Learning via Deep CCA for Silent Speech Recognition. Proc. Interspeech 2017, 2769-2773, doi: 10.21437/Interspeech.2017-952

  author={Myungjong Kim and Beiming Cao and Ted Mau and Jun Wang},
  title={{Multiview Representation Learning via Deep CCA for Silent Speech Recognition}},
  booktitle={Proc. Interspeech 2017},