15th Annual Conference of the International Speech Communication Association

September 14-18, 2014

Cluster Based Chinese Abbreviation Modeling

Yangyang Shi, Yi-Cheng Pan, Mei-Yuh Hwang

Microsoft, China

Abbreviations in Chinese are widely observed in Chinese spoken language. Automatic generation of Chinese abbreviations helps to improve Chinese natural language understanding systems and Chinese search engine. The abbreviation generation is treated as a character-based tagging problem. Due to limited training data, Chinese abbreviation generation suffers from data sparseness. Two types of strategies are proposed to reduce the impact from data sparseness. First of all, in addition to using a traditional sequence labelling method — Conditional Random Fields (CRF), we propose to apply Recurrent Neural Network with Maximum Entropy Extension (RNNME), which actually shows similar performance as using crf in our experiment. Secondly, we propose to use training data clustering and latent topic modeling in abbreviation generation. Using training data clustering or topic modeling not only addresses the data sparseness, but also takes advantage of the fact that full-names from the same cluster or the same latent topic have similar abbreviation patterns. Our experimental results show that using manual clustering, the accuracy of abbreviation generation achieves relatively 8% improvement. Using Latent topics that are obtained from Latent Dirichlet Allocation (LDA), the accuracy achieves relative 10% improvement.

Full Paper

Bibliographic reference.  Shi, Yangyang / Pan, Yi-Cheng / Hwang, Mei-Yuh (2014): "Cluster based Chinese abbreviation modeling", In INTERSPEECH-2014, 273-277.