15th Annual Conference of the International Speech Communication Association

September 14-18, 2014

Improving Language-Universal Feature Extraction with Deep Maxout and Convolutional Neural Networks

Yajie Miao, Florian Metze

Carnegie Mellon University, USA

When deployed in automated speech recognition (ASR), deep neural networks (DNNs) can be treated as a complex feature extractor plus a simple linear classifier. Previous work has investigated the utility of multilingual DNNs acting as language-universal feature extractors (LUFEs). In this paper, we explore different strategies to further improve LUFEs. First, we replace the standard sigmoid nonlinearity with the recently proposed maxout units. The resulting maxout LUFEs have the nice property of generating sparse feature representations. Second, the convolutional neural network (CNN) architecture is applied to obtain more invariant feature space. We evaluate the performance of LUFEs on a cross-language ASR task. Each of the proposed techniques results in word error rate reduction compared with the existing DNN-based LUFEs. Combining the two methods together brings additional improvement on the target language.

Full Paper

Bibliographic reference.  Miao, Yajie / Metze, Florian (2014): "Improving language-universal feature extraction with deep maxout and convolutional neural networks", In INTERSPEECH-2014, 800-804.