A Comparison of Input Types to a Deep Neural Network-based Forced Aligner

Matthew C. Kelley, Benjamin V. Tucker


The present paper investigates the effect of different inputs on the accuracy of a forced alignment tool built using deep neural networks. Both raw audio samples and Mel-frequency cepstral coefficients were compared as network inputs. A set of experiments were performed using the TIMIT speech corpus as training data and its accompanying test data set. The networks consisted of a series of convolutional layers followed by a series of bidirectional long short-term memory (LSTM) layers. The convolutional layers were trained first to act as feature detectors, after which their weights were frozen. Then, the LSTM layers were trained to learn the temporal relations in the data. The current results indicate that networks using raw audio perform better than those using Mel-frequency cepstral coefficients and an off-the-shelf forced aligner. Possible explanations for why the raw audio networks perform better are discussed. We then lay out potential ways to improve the results of the networks and conclude with a comparison of human cognition to network architecture.


 DOI: 10.21437/Interspeech.2018-1115

Cite as: Kelley, M.C., Tucker, B.V. (2018) A Comparison of Input Types to a Deep Neural Network-based Forced Aligner. Proc. Interspeech 2018, 1205-1209, DOI: 10.21437/Interspeech.2018-1115.


@inproceedings{Kelley2018,
  author={Matthew C. Kelley and Benjamin V. Tucker},
  title={A Comparison of Input Types to a Deep Neural Network-based Forced Aligner},
  year=2018,
  booktitle={Proc. Interspeech 2018},
  pages={1205--1209},
  doi={10.21437/Interspeech.2018-1115},
  url={http://dx.doi.org/10.21437/Interspeech.2018-1115}
}