We present a method of modeling non-lexical vocabulary items such as numbers, times, dates, monetary amounts and address components that avoids the data sparsity and out-of-vocabulary problems of written-domain language models. Like previous approaches, we use a class-based language model and efficient finite-state class grammars during run-time decoding. We mitigate the problem of context-independent replacement of class items by employing a contextual sequence labeling model to identify which class instances should be replaced, leaving others to appear in their original form. Applied to the task of general voice-search audio transcription, our method achieves 10% relative error reduction (on the numeric error rate metric) compared to the previous system (based on a verbalizer transducer). On a numeric entity recognition task, our method achieves a 23% relative error reduction on the same metric. In both cases, word error rate remains the same or is reduced.
Cite as: Vasserman, L., Schogol, V., Hall, K. (2015) Sequence-based class tagging for robust transcription in ASR. Proc. Interspeech 2015, 473-477, doi: 10.21437/Interspeech.2015-178
@inproceedings{vasserman15_interspeech, author={Lucy Vasserman and Vlad Schogol and Keith Hall}, title={{Sequence-based class tagging for robust transcription in ASR}}, year=2015, booktitle={Proc. Interspeech 2015}, pages={473--477}, doi={10.21437/Interspeech.2015-178} }