Sixth International Conference on Spoken Language Processing (ICSLP 2000)
October 16-20, 2000
Semantic Tokenization of Verbalized Numbers in Language Modeling
Xiaoqiang Luo, Martin Franz
IBM T.J. Waston Research Center,
Yorktown Heights, NY, USA
In spoken dialog systems, number strings frequently carry
crucial information such as DATE, TIME, and PRICE. Yet
numbers are inherently difficult to recognize, partly because
reliable statistics for training a language model is hard to
obtain. In this paper, we take the advantage of the fact that
dialog systems perform some form of semantic parsing. We
use this parsing information to distinguish between the occurrences
of number expressions in various semantic roles,
as for example between the word "one" in "one o’clock",
"sunday june one" and "another one" to improve the performance
of the language model and thus reduce the error rate.
We process number expressions with the same spelling, but
different semantics, as separate language model tokens. We
have tested this approach in a speech recognition system
used as a part of a dialog system for the Air Travel domain.
In a controlled experiment, the proposed technique yields a
healthy 9.75% relative (overall) word error reduction on a
test set of 689 sentences, collected using a live telephony
Air Travel system.
Luo, Xiaoqiang / Franz, Martin (2000):
"Semantic tokenization of verbalized numbers in language modeling",
In ICSLP-2000, vol.1, 158-161.