This paper summarises the Toshiba entry to the single-array track of the CHiME 2018 speech recognition challenge. The system is based on conventional acoustic modelling (AM), where phonetic targets are tied to features at the frame-level, and use the provided tri-gram language model. The system is ranked in category A that focuses on acoustic robustness. Array signals are first enhanced using speaker dependent generalised eigenvalue (GEV) based beamforming. Two different acoustic representations are then extracted from the enhanced signals: i) log Mel filter-bank and ii) subband temporal envelope (STE) features. Separate acoustic models, trained on each set, are used for lattice combination. The AM combines convolutional and recurrent architectures in a single CNN-BLSTM model. Speaker adaptation, limited to vocal tract length normalisation (VTLN), de-reverberation and speaker suppression are also considered. Following system combination, the Toshiba entry achieves 60.8% word error rate (WER) on the development (dev) set and 56.5% WER on the evaluation (eval) set respectively. The system is ranked 4th in the A category.
Cite as: Doddipatla, R., Kagoshima, T., Do, C.-T., Petkov, P., Zorila, C.-T., Kim, E., Hayakawa, D., Fujimura, H., Stylianou, Y. (2018) The Toshiba entry to the CHiME 2018 Challenge. Proc. 5th International Workshop on Speech Processing in Everyday Environments (CHiME 2018), 41-45, doi: 10.21437/CHiME.2018-9
@inproceedings{doddipatla18_chime, author={Rama Doddipatla and Takehiko Kagoshima and Cong-Thanh Do and Petko Petkov and Catalin-Tudor Zorila and Euihyun Kim and Daichi Hayakawa and Hiroshi Fujimura and Yannis Stylianou}, title={{The Toshiba entry to the CHiME 2018 Challenge}}, year=2018, booktitle={Proc. 5th International Workshop on Speech Processing in Everyday Environments (CHiME 2018)}, pages={41--45}, doi={10.21437/CHiME.2018-9} }