Sequence-to-sequence Neural Network Model with 2D Attention for Learning Japanese Pitch Accents

Antoine Bruguier, Heiga Zen, Arkady Arkhangorodsky


Many Japanese text-to-speech (TTS) systems use word-level pitch accents as one of the prosodic features. Combination of a pronunciation dictionary including lexical pitch accents and a statistical model representing the word accent sandhi is often used to predict pitch accents from a text. However, using human transcribers to build the dictionary and training data for the model is tedious and expensive. This paper proposes a neural pitch accent recognition model. This model combines the information from audio and its transcription (word sequence in hiragana characters) via two-dimensional attention and outputs word-level pitch accents. Experimental results show a reduction in the word pitch accent prediction error rate over that with text only. It lowers the load of human annotators when building a pronunciation dictionary. As the approach is general, it can be used to do pronunciation learning in other languages as well.


 DOI: 10.21437/Interspeech.2018-1381

Cite as: Bruguier, A., Zen, H., Arkhangorodsky, A. (2018) Sequence-to-sequence Neural Network Model with 2D Attention for Learning Japanese Pitch Accents. Proc. Interspeech 2018, 1284-1287, DOI: 10.21437/Interspeech.2018-1381.


@inproceedings{Bruguier2018,
  author={Antoine Bruguier and Heiga Zen and Arkady Arkhangorodsky},
  title={Sequence-to-sequence Neural Network Model with 2D Attention for Learning Japanese Pitch Accents},
  year=2018,
  booktitle={Proc. Interspeech 2018},
  pages={1284--1287},
  doi={10.21437/Interspeech.2018-1381},
  url={http://dx.doi.org/10.21437/Interspeech.2018-1381}
}