Sixth International Conference on Spoken Language Processing
This paper introduces a method that can efficiently reduce acoustic model size and computation for LVCSR based on continuous-density hidden Mokov model (CDHMM). The method uses Bhattacharyya distance measure as a criterion to quantize the mean and variance vectors of Gaussian mixture. To minimize the quantization error, the feature vector was separated into multiple streams (such as MFCCs, delta-MFCCs and delta-delta MFCCs) and then the modified K-means clustering algorithm was applied to each stream. The key ideas of our modified K-means clustering algorithm are based on the strategy which dynamically splits and merges cluster according to its size and average distortion during each iteration for each cluster. The proposed approach can cut acoustic model size by 87% from 21.42MB to 2.75MB from a CDHMM baseline system (with 12 mixtures , 6k states) by using 256 and 8192 codewords for each stream of mean and variance vectors of Gaussian mixtures. The recognition experiment on Chinese LVCSR dictation system (of 51K words ) shows that using the 87% smaller compact model, the WER increased by 5% to 10.3% from 9.8% for the CDHMM baseline system. After quantization, the Gaussian likelihood can be pre-computed only once at the beginning of every frame and their values can be stored in a lookup table, so the computation during decoding is greatly reduced as well.
Bibliographic reference. Pan, Jielin / Yuan, Baosheng / Yan, Yonghong (2000): "Effective vector quantization for a highly compact acoustic model for LVCSR", In ICSLP-2000, vol.4, 318-321.