FAAVSP - The 1st Joint Conference on
Facial Analysis, Animation, and
The application of neural network and convolutional neural network
(CNN) architectures is explored for the tasks of voicing
classification (classifying frames as being either non-speech,
unvoiced, or voiced) and voice activity detection (VAD) of visual
speech. Experiments are conducted for both speaker dependent
and speaker independent scenarios.
A Gaussian mixture model (GMM) baseline system is developed using standard image-based two-dimensional discrete cosine transform (2D-DCT) visual speech features, achieving speaker dependent accuracies of 79% and 94 %, for voicing classification and VAD respectively. Additionally, a singlelayer neural network system trained using the same visual features achieves accuracies of 86% and 97 %. A novel technique using convolutional neural networks for visual speech feature extraction and classification is presented. The voicing classification and VAD results using the system are further improved to 88% and 98% respectively.
The speaker independent results show the neural network system to outperform both theGMMand CNN systems, achieving accuracies of 63% for voicing classification, and 79% for voice activity detection. Index Terms: convolutional neural networks, voicing classification, visual speech
Bibliographic reference. Cornu, Thomas Le / Milner, Ben (2015): "Voicing classification of visual speech using convolutional neural networks", In FAAVSP-2015, 103-108.