Although there have been some promising results in computer lipreading,
there has been a paucity of data on which to train automatic systems.
However the recent emergence of the TCD-TIMIT corpus, with around 6000
words, 59 speakers and seven hours of recorded audio-visual speech,
allows the deployment of more recent techniques in audio-speech such
as Deep Neural Networks (DNNs) and sequence discriminative training.
In this paper we combine the DNN with a Hidden Markov Model (HMM)
to the, so called, hybrid DNN-HMM configuration which we train using
a variety of sequence discriminative training methods. This is then
followed with a weighted finite state transducer. The conclusion is
that the DNN offers very substantial improvement over a conventional
classifier which uses a Gaussian Mixture Model (GMM) to model the densities
even when optimised with Speaker Adaptive Training. Sequence adaptive
training offers further improvements depending on the precise variety
employed but those improvements are of the order of 10% improvement
in word accuracy. Putting these two results together implies that lipreading
is moving from something of rather esoteric interest to becoming a
practical reality in the foreseeable future.
Cite as: Thangthai, K., Harvey, R. (2017) Improving Computer Lipreading via DNN Sequence Discriminative Training Techniques. Proc. Interspeech 2017, 3657-3661, doi: 10.21437/Interspeech.2017-106
@inproceedings{thangthai17_interspeech, author={Kwanchiva Thangthai and Richard Harvey}, title={{Improving Computer Lipreading via DNN Sequence Discriminative Training Techniques}}, year=2017, booktitle={Proc. Interspeech 2017}, pages={3657--3661}, doi={10.21437/Interspeech.2017-106} }