Robustness of speech recognition can be significantly improved by multi-stream and especially audiovisual speech recognition, which is of interest e.g. for human-machine interaction in noisy and reverberant environments. The most robust implementations of audiovisual speech recognition often utilize Coupled Hidden Markov Models (CHMMs), which allow for both modalities to be asynchronous to a certain degree. In contrast to conventional speech recognition, this increases the search space significantly, so current implementations of CHMM systems are often not real-time capable. Thus, in order to obtain responsive multi-modal interfaces, using current processing capabilities is vital. This paper describes how general purpose graphics processors can be used to obtain a real-time implementation of audiovisual and multi-stream speech recognition. The design has been integrated both with a WFST-decoder and a token passing system, leading to a maximum speedup factor of 32 and 25, respectively.
Bibliographic reference. Kolossa, Dorothea / Chong, Jike / Zeiler, Steffen / Keutzer, Kurt (2010): "Efficient manycore CHMM speech recognition for audiovisual and multistream data", In INTERSPEECH-2010, 2698-2701.