Label Driven Time-Frequency Masking for Robust Continuous Speech Recognition

Meet Soni, Ashish Panda


The application of Time-Frequency (T-F) masking based approaches for Automatic Speech Recognition has been shown to provide significant gains in system performance in the presence of additive noise. Such approaches give performance improvement when the T-F masking front-end is trained jointly with the acoustic model. However, such systems still rely on a pre-trained T-F masking enhancement block, trained using pairs of clean and noisy speech signals. Pre-training is necessary due to large number of parameters associated with the enhancement network. In this paper, we propose a flat-start joint training of a network that has both a T-F masking based enhancement block and a phoneme classification block. In particular, we use fully convolutional network as an enhancement front-end to reduce the number of parameters. We train the network by jointly updating the parameters of both these blocks using tied Context-Dependent phoneme states as targets. We observe that pretraining of the proposed enhancement block is not necessary for the convergence. In fact, the proposed flat-start joint training converges faster than the baseline multi-condition trained model. The experiments performed on Aurora-4 database show 7.06% relative improvement over multi-conditioned baseline. We get similar improvements for unseen test conditions as well.


 DOI: 10.21437/Interspeech.2019-2172

Cite as: Soni, M., Panda, A. (2019) Label Driven Time-Frequency Masking for Robust Continuous Speech Recognition. Proc. Interspeech 2019, 426-430, DOI: 10.21437/Interspeech.2019-2172.


@inproceedings{Soni2019,
  author={Meet Soni and Ashish Panda},
  title={{Label Driven Time-Frequency Masking for Robust Continuous Speech Recognition}},
  year=2019,
  booktitle={Proc. Interspeech 2019},
  pages={426--430},
  doi={10.21437/Interspeech.2019-2172},
  url={http://dx.doi.org/10.21437/Interspeech.2019-2172}
}