Tandem systems based on multi-layer perceptrons (MLPs) have improved the performance
of automatic speech recognition systems on both large vocabulary and noisy tasks.
One potential problem of the standard Tandem approach, however, is that the
MLPs generally used do not model temporal dynamics inherent in speech. In this
work, we propose a hybrid MLP/Structured-SVM model, in which the parameters
between the hidden layer and output layer and temporal transitions between output
layers are modeled by a Structured-SVM. A Structured-SVM can be thought of as
an extension to the classical binary support vector machine which can naturally
classify structures such as sequences. Using this approach, we can
identify sequences of phones in an utterance.
We try this model on two different corpora Aurora2 and the large-vocabulary section of the ICSI meeting corpus to investigate the model's performance in noisy conditions and on a large-vocabulary task. Compared to a difficult Tandem baseline in which the MLP is trained using 2nd-order optimization methods, the MLP/Structured-SVM system decreases WER in noisy conditions by 7.9% relative. On the large vocabulary corpus, the proposed system decreasesWER by 1.1% absolute compared to the 2nd-order Tandem system.
Bibliographic reference. Ravuri, Suman V. (2014): "Hybrid MLP/structured-SVM tandem systems for large vocabulary and robust ASR", In INTERSPEECH-2014, 2729-2733.