Vocal Pitch Extraction in Polyphonic Music Using Convolutional Residual Network

Mingye Dong, Jie Wu, Jian Luan


Pitch extraction, also known as fundamental frequency estimation, is a long-term task in audio signal processing. Especially, due to the presence of accompaniment, vocal pitch extraction in polyphonic music is more challenging. So far, most of deep learning approaches use log mel spectrogram as input, which neglect the phase information. In addition, shallow networks have been applied on waveform directly, which may not handle contaminated vocal data well. In this paper, a deep convolutional residual network is proposed. It analyzes and extracts effective feature from waveform automatically. Residual learning can reduce model degradation due to the skip connection and residual mapping. In comparison to reported results, the proposed approach shows 5% and 4% improvement on overall accuracy(OA) and raw pitch accuracy(RPA) respectively.


 DOI: 10.21437/Interspeech.2019-2286

Cite as: Dong, M., Wu, J., Luan, J. (2019) Vocal Pitch Extraction in Polyphonic Music Using Convolutional Residual Network. Proc. Interspeech 2019, 2010-2014, DOI: 10.21437/Interspeech.2019-2286.


@inproceedings{Dong2019,
  author={Mingye Dong and Jie Wu and Jian Luan},
  title={{Vocal Pitch Extraction in Polyphonic Music Using Convolutional Residual Network}},
  year=2019,
  booktitle={Proc. Interspeech 2019},
  pages={2010--2014},
  doi={10.21437/Interspeech.2019-2286},
  url={http://dx.doi.org/10.21437/Interspeech.2019-2286}
}