ISCA Archive SSW 2021
ISCA Archive SSW 2021

Analysing Temporal Sensitivity of VQ-VAE Sub-Phone Codebooks

Jason Fong, Jennifer Williams, Simon King

In this work we present an analysis of temporal sensitivity of VQ-VAE sub-phone token sequences. Previous work has demonstrated that VQ-VAE systems learn a type of sub-phone representation. However, a detailed examination of the representations themselves is currently lacking. We address this gap by exploring linguistic unit reorganisation. Our experiments show that sub-phone codebook sequences are temporally correlated enough to identify VQ codes that correspond to distinct linguistic units. We found that it is possible to extract VQ codes and re-arrange these linguistic units in a meaningful way (i.e. changing the word-order of a sentence). This work puts us one step closer to understanding how to modify pronunciations at a fine granularity, such as below the phone-level unit.


doi: 10.21437/SSW.2021-40

Cite as: Fong, J., Williams, J., King, S. (2021) Analysing Temporal Sensitivity of VQ-VAE Sub-Phone Codebooks. Proc. 11th ISCA Speech Synthesis Workshop (SSW 11), 227-231, doi: 10.21437/SSW.2021-40

@inproceedings{fong21b_ssw,
  author={Jason Fong and Jennifer Williams and Simon King},
  title={{Analysing Temporal Sensitivity of VQ-VAE Sub-Phone Codebooks}},
  year=2021,
  booktitle={Proc. 11th ISCA Speech Synthesis Workshop (SSW 11)},
  pages={227--231},
  doi={10.21437/SSW.2021-40}
}