Generative Acoustic-Phonemic-Speaker Model Based on Three-Way Restricted Boltzmann Machine

Toru Nakashika, Yasuhiro Minami


In this paper, we argue the way of modeling speech signals based on three-way restricted Boltzmann machine (3WRBM) for separating phonetic-related information and speaker-related information from an observed signal automatically. The proposed model is an energy-based probabilistic model that includes three-way potentials of three variables: acoustic features, latent phonetic features, and speaker-identity features. We train the model so that it automatically captures the undirected relationships among the three variables. Once the model is trained, it can be applied to many tasks in speech signal processing. For example, given a speech signal, estimating speaker-identity features is equivalent to speaker recognition; on the other hand, estimated latent phonetic features may be helpful for speech recognition because they contain more phonetic-related information than the acoustic features. Since the model is generative, we can also apply it to voice conversion; i.e., we just estimate acoustic features from the phonetic features that were estimated given the source speakers acoustic features along with the desired speaker-identity features. In our experiments, we discuss the effectiveness of the speech modeling through a speaker recognition, a speech (continuous phone) recognition, and a voice conversion tasks.


DOI: 10.21437/Interspeech.2016-1105

Cite as

Nakashika, T., Minami, Y. (2016) Generative Acoustic-Phonemic-Speaker Model Based on Three-Way Restricted Boltzmann Machine. Proc. Interspeech 2016, 1487-1491.

Bibtex
@inproceedings{Nakashika+2016,
author={Toru Nakashika and Yasuhiro Minami},
title={Generative Acoustic-Phonemic-Speaker Model Based on Three-Way Restricted Boltzmann Machine},
year=2016,
booktitle={Interspeech 2016},
doi={10.21437/Interspeech.2016-1105},
url={http://dx.doi.org/10.21437/Interspeech.2016-1105},
pages={1487--1491}
}