The combination of acoustic models or features is a standard approach to exploit various knowledge sources. This paper investigates the concatenation of different bottleneck (BN) neural network (NN) outputs for tandem acoustic modeling. Thus, combination of NN features is performed via Gaussian mixture models (GMM). Complementarity between the NN feature representations is attained by using various network topologies: LSTM recurrent, feed-forward, and hierarchical, as well as different non-linearities: hyperbolic tangent, sigmoid, and rectified linear units. Speech recognition experiments are carried out on various tasks: telephone conversations, Skype calls, as well as broadcast news and conversations. Results indicate that LSTM based tandem approach is still competitive, and such tandem model can challenge comparable hybrid systems. The traditional steps of tandem modeling, speaker adaptive and sequence discriminative GMM training, improve the tandem results further. Furthermore, these “old-fashioned” steps remain applicable after the concatenation of multiple neural network feature streams. Exploiting the parallel processing of input feature streams, it is shown that 2–5% relative improvement could be achieved over the single best BN feature set. Finally, we also report results after neural network based language model rescoring and examine the system combination possibilities using such complex tandem models.
Cite as: Tüske, Z., Michel, W., Schlüter, R., Ney, H. (2017) Parallel Neural Network Features for Improved Tandem Acoustic Modeling. Proc. Interspeech 2017, 1651-1655, doi: 10.21437/Interspeech.2017-1747
@inproceedings{tuske17_interspeech, author={Zoltán Tüske and Wilfried Michel and Ralf Schlüter and Hermann Ney}, title={{Parallel Neural Network Features for Improved Tandem Acoustic Modeling}}, year=2017, booktitle={Proc. Interspeech 2017}, pages={1651--1655}, doi={10.21437/Interspeech.2017-1747} }