Highly Accurate Mandarin Tone Classification In The Absence of Pitch Information

Neville Ryant, Malcolm Slaney, Mark Liberman, Elizabeth Shriberg, Jiahong Yuan


A deep neural network (DNN) classifier based only on 40 mel-frequency cepstral coefficients (MFCCs) achieved 29.99% frame error rate (FER) and 16.86% segment error rate (SER) in recognizing five tonal categories in Mandarin Chinese broadcast news. With the addition of sub- band autocorrelation change detection (SACD) pitch-class features, the classifier scored 27.58% FER and 15.56% SER. These results are substantially better than the best previously reported results on broadcast-news tone classification, and are also better than a human listener achieved in categorizing test stimuli created by amplitude- and frequency-modulating complex tones to match the extracted F0 and amplitude parameters. The same DNN architecture scored substantially worse when trained and tested with SACD pitch-class parameters alone: 39.22% FER and 24.89% SER. RAPT F0 estimates are worse yet: 44.37% FER and 27.28% SER. The 40 MFCC parameters do not encode F0 in any obvious way and attempts to predict SACD or other pitch features from them work badly. These surprising results raise difficult questions for theories of Chinese tone.


 DOI: 10.21437/SpeechProsody.2014-122

Cite as: Ryant, N., Slaney, M., Liberman, M., Shriberg, E., Yuan, J. (2014) Highly Accurate Mandarin Tone Classification In The Absence of Pitch Information. Proc. 7th International Conference on Speech Prosody 2014, 673-677, DOI: 10.21437/SpeechProsody.2014-122.


@inproceedings{Ryant2014,
  author={Neville Ryant and Malcolm Slaney and Mark Liberman and Elizabeth Shriberg and Jiahong Yuan},
  title={{Highly Accurate Mandarin Tone Classification In The Absence of Pitch Information}},
  year=2014,
  booktitle={Proc. 7th International Conference on Speech Prosody 2014},
  pages={673--677},
  doi={10.21437/SpeechProsody.2014-122},
  url={http://dx.doi.org/10.21437/SpeechProsody.2014-122}
}