Formant Estimation and Tracking Using Deep Learning

Yehoshua Dissen, Joseph Keshet


Formant frequency estimation and tracking are among the most fundamental problems in speech processing. In the former task the input is a stationary speech segment such as the middle part of a vowel and the goal is to estimate the formant frequencies, whereas in the latter task the input is a series of speech frames and the goal is to track the trajectory of the formant frequencies throughout the signal. Traditionally, formant estimation and tracking is done using ad-hoc signal processing methods. In this paper we propose using machine learning techniques trained on an annotated corpus of read speech for these tasks. Our feature set is composed of LPC-based cepstral coefficients with a range of model orders and pitch-synchronous cepstral coefficients. Two deep network architectures are used as learning algorithms: a deep feed-forward network for the estimation task and a recurrent neural network for the tracking task. The performance of our methods compares favorably with mainstream LPC-based implementations and state-of-the-art tracking algorithms.


DOI: 10.21437/Interspeech.2016-490

Cite as

Dissen, Y., Keshet, J. (2016) Formant Estimation and Tracking Using Deep Learning. Proc. Interspeech 2016, 958-962.

Bibtex
@inproceedings{Dissen+2016,
author={Yehoshua Dissen and Joseph Keshet},
title={Formant Estimation and Tracking Using Deep Learning},
year=2016,
booktitle={Interspeech 2016},
doi={10.21437/Interspeech.2016-490},
url={http://dx.doi.org/10.21437/Interspeech.2016-490},
pages={958--962}
}