FAAVSP - The 1st Joint Conference on Facial Analysis, Animation, and
Auditory-Visual Speech Processing

Vienna, Austria
September 11-13, 2015

Voicing Classification of Visual Speech Using Convolutional Neural Networks

Thomas Le Cornu, Ben Milner

University of East Anglia, Norwich, UK

The application of neural network and convolutional neural network (CNN) architectures is explored for the tasks of voicing classification (classifying frames as being either non-speech, unvoiced, or voiced) and voice activity detection (VAD) of visual speech. Experiments are conducted for both speaker dependent and speaker independent scenarios.
   A Gaussian mixture model (GMM) baseline system is developed using standard image-based two-dimensional discrete cosine transform (2D-DCT) visual speech features, achieving speaker dependent accuracies of 79% and 94 %, for voicing classification and VAD respectively. Additionally, a singlelayer neural network system trained using the same visual features achieves accuracies of 86% and 97 %. A novel technique using convolutional neural networks for visual speech feature extraction and classification is presented. The voicing classification and VAD results using the system are further improved to 88% and 98% respectively.
   The speaker independent results show the neural network system to outperform both theGMMand CNN systems, achieving accuracies of 63% for voicing classification, and 79% for voice activity detection. Index Terms: convolutional neural networks, voicing classification, visual speech

Full Paper

Bibliographic reference.  Cornu, Thomas Le / Milner, Ben (2015): "Voicing classification of visual speech using convolutional neural networks", In FAAVSP-2015, 103-108.