We propose two approaches for improving the objective function for the deep neural network (DNN) frame-level training in large vocabulary continuous speech recognition (LVCSR). The DNNs used in LVCSR are often constructed with an output layer with softmax activation and the cross-entropy objective function is always employed in the frame-leveling training of DNNs. The pairing of softmax activation and cross-entropy objective function contributes much in the success of DNN. The first approach developed in this paper improves the cross-entropy objective function by boosting the importance of the frames for which the DNN model has low target predictions (low target posterior probabilities) and the second one considers jointly minimizing the cross-entropy and maximizing the log posterior ratio between the target senone (tied-triphone states) and the most competing one. Experiments on Switchboard task demonstrate that the two proposed methods can provide 3.1% and 1.5% relative word error rate (WER) reduction , respectively, against the already very strong conventional cross-entropy trained DNN system.
Bibliographic reference. Huang, Zhen / Li, Jinyu / Weng, Chao / Lee, Chin-Hui (2014): "Beyond cross-entropy: towards better frame-level objective functions for deep neural network training in automatic speech recognition", In INTERSPEECH-2014, 1214-1218.