We present a method of modeling non-lexical vocabulary items such as numbers, times, dates, monetary amounts and address components that avoids the data sparsity and out-of-vocabulary problems of written-domain language models. Like previous approaches, we use a class-based language model and efficient finite-state class grammars during run-time decoding. We mitigate the problem of context-independent replacement of class items by employing a contextual sequence labeling model to identify which class instances should be replaced, leaving others to appear in their original form. Applied to the task of general voice-search audio transcription, our method achieves 10% relative error reduction (on the numeric error rate metric) compared to the previous system (based on a verbalizer transducer). On a numeric entity recognition task, our method achieves a 23% relative error reduction on the same metric. In both cases, word error rate remains the same or is reduced.
Bibliographic reference. Vasserman, Lucy / Schogol, Vlad / Hall, Keith (2015): "Sequence-based class tagging for robust transcription in ASR", In INTERSPEECH-2015, 473-477.