ToneNet: A CNN Model of Tone Classification of Mandarin Chinese

Qiang Gao, Shutao Sun, Yaping Yang


In Mandarin Chinese, correct pronunciation is the key to convey word meaning correctly and the correct pronunciation is closely related to the tone of text. Therefore, tone classification is a critical part of speech evaluation system. Traditional tone classification is based on F0 and energy or MFCCs. But the extraction of these features is often subject to noise and other uncontrollable environmental factors. Thus, in order to reduce the influence of environment, we designed a CNN network named ToneNet which adopts mel-spectrogram as a feature and uses a customed convolutional neural network and multi-layer perceptron to classify Chinese syllables into one of the four tones. We trained and tested ToneNet on the Syllable Corpus of Standard Chinese Dataset (SCSC). The result shows that the best accuracy and f1-score of our method have reached 99.16% and 99.11% respectively. Besides, ToneNet has achieved 97.07% of accuracy and 96.83% of f1-score with the condition of gaussian noise.


 DOI: 10.21437/Interspeech.2019-1483

Cite as: Gao, Q., Sun, S., Yang, Y. (2019) ToneNet: A CNN Model of Tone Classification of Mandarin Chinese. Proc. Interspeech 2019, 3367-3371, DOI: 10.21437/Interspeech.2019-1483.


@inproceedings{Gao2019,
  author={Qiang Gao and Shutao Sun and Yaping Yang},
  title={{ToneNet: A CNN Model of Tone Classification of Mandarin Chinese}},
  year=2019,
  booktitle={Proc. Interspeech 2019},
  pages={3367--3371},
  doi={10.21437/Interspeech.2019-1483},
  url={http://dx.doi.org/10.21437/Interspeech.2019-1483}
}