Direct F0 Estimation with Neural-Network-Based Regression

Shuzhuang Xu, Hiroshi Shimodaira


Pitch tracking, or the continuous extraction of fundamental frequency from speech waveforms, is of vital importance to many applications in speech analysis and synthesis. Many existing trackers, including conventional ones such as Praat, RAPT and YIN, and newly proposed neural-network-based ones such as DNN-CLS, CREPE and RNN-REG, have conducted an extensive investigation into speech pitch tracking. This work developed a different end-to-end regression model based on neural networks, where a voice detector and a newly proposed value estimator work jointly to highlight the trajectory of fundamental frequency. Experiments on the PTDB-TUG corpus showed that the system surpasses canonical neural networks in terms of gross error rate. It further outperformed conventional trackers under clean condition and neural-network classifiers under noisy condition by the NOISEX-92 corpus.


 DOI: 10.21437/Interspeech.2019-3267

Cite as: Xu, S., Shimodaira, H. (2019) Direct F0 Estimation with Neural-Network-Based Regression. Proc. Interspeech 2019, 1995-1999, DOI: 10.21437/Interspeech.2019-3267.


@inproceedings{Xu2019,
  author={Shuzhuang Xu and Hiroshi Shimodaira},
  title={{Direct F0 Estimation with Neural-Network-Based Regression}},
  year=2019,
  booktitle={Proc. Interspeech 2019},
  pages={1995--1999},
  doi={10.21437/Interspeech.2019-3267},
  url={http://dx.doi.org/10.21437/Interspeech.2019-3267}
}