Subword tokenization based on DNN-based acoustic model for end-to-end prosody generation

Masashi Aso, Shinnosuke Takamichi, Norihiro Takamune, Hiroshi Saruwatari


This paper presents a method for determining subword units for end-to-end prosody generation. End-to-end prosody generation using deep neural networks (DNNs) is expected to directly generate a prosody sequence from text without any professional knowledge in the target language. In natural language processing, language model-based language-independent subword tokenization was previously proposed for determining subwords suitable for end-to-end language processing. However, the subwords determined by the language models are not appropriate for end-to-end speech processing. In this paper, we propose a language-independent algorithm for determining subwords that maximize acoustic model likelihoods. The proposed algorithm iterates expectation-maximization (EM)-based training of DNN acoustic models and likelihood-based construction of the subword vocabulary. In the experimental evaluation, we discuss the stability of the EM-based training and analyze subword vocabularies determined by the conventional language model-based and proposed acoustic model-based methods.


 DOI: 10.21437/SSW.2019-42

Cite as: Aso, M., Takamichi, S., Takamune, N., Saruwatari, H. (2019) Subword tokenization based on DNN-based acoustic model for end-to-end prosody generation. Proc. 10th ISCA Speech Synthesis Workshop, 234-238, DOI: 10.21437/SSW.2019-42.


@inproceedings{Aso2019,
  author={Masashi Aso and Shinnosuke Takamichi and Norihiro Takamune and Hiroshi Saruwatari},
  title={{Subword tokenization based on DNN-based acoustic model for end-to-end prosody generation}},
  year=2019,
  booktitle={Proc. 10th ISCA Speech Synthesis Workshop},
  pages={234--238},
  doi={10.21437/SSW.2019-42},
  url={http://dx.doi.org/10.21437/SSW.2019-42}
}