The Toshiba entry to the CHiME 2018 Challenge

Rama Doddipatla, Takehiko Kagoshima, Cong-Thanh Do, Petko Petkov, Catalin-Tudor Zorila, Euihyun Kim, Daichi Hayakawa, Hiroshi Fujimura, Yannis Stylianou

This paper summarises the Toshiba entry to the single-array track of the CHiME 2018 speech recognition challenge. The system is based on conventional acoustic modelling (AM), where phonetic targets are tied to features at the frame-level, and use the provided tri-gram language model. The system is ranked in category A that focuses on acoustic robustness. Array signals are first enhanced using speaker dependent generalised eigenvalue (GEV) based beamforming. Two different acoustic representations are then extracted from the enhanced signals: i) log Mel filter-bank and ii) subband temporal envelope (STE) features. Separate acoustic models, trained on each set, are used for lattice combination. The AM combines convolutional and recurrent architectures in a single CNN-BLSTM model. Speaker adaptation, limited to vocal tract length normalisation (VTLN), de-reverberation and speaker suppression are also considered. Following system combination, the Toshiba entry achieves 60.8% word error rate (WER) on the development (dev) set and 56.5% WER on the evaluation (eval) set respectively. The system is ranked 4th in the A category.

