10th Annual Conference of the International Speech Communication Association

Brighton, United Kingdom
September 6-10, 2009

Using VTLN Matrices for Rapid and Computationally-Efficient Speaker Adaptation with Robustness to First-Pass Transcription Errors

S. P. Rath, S. Umesh, A. K. Sarkar

IIT Kanpur, India

In this paper, we propose to combine the rapid adaptation capability of conventional Vocal Tract Length Normalization (VTLN) with the computational efficiency of transform-based adaptation such as MLLR or CMLLR. VTLN requires the estimation of only one parameter and is, therefore, most suited for the cases where there is little adaptation data (i.e. rapid adaptation). In contrast, transform-based adaptation methods require the estimation of matrices. However, the drawback of conventional VTLN is that it is computationally expensive since it requires multiple spectral-warping to generate VTLN-warped features. We have recently shown that VTLN-warping can be implemented by a lineartransformation (LT) of the conventional MFCC features. These LTs are analytically pre-computed and stored. In this frame-work of LT VTLN, computational complexity of VTLN is similar to transformbased adaptation since warp-factor estimation can be done using the same sufficient statistics as that are used in CMLLR. We show that VTLN provides significant improvement in performance when there is small adaptation data as compared to transform-based adaptation methods. We also show that the use of an additional decorrelating transform, MLLT, along with the VTLN-matrices, gives performance that is better than MLLR and comparable to SAT with MLLT even for large adaptation data. Further we show that in the mismatched train and test case (i.e. poor first-pass transcription), VTLN provides significant improvement over the transform-based adaptation methods. We compare the performances of different methods on the WSJ, the RM and the TIDIGITS databases.

Full Paper

Bibliographic reference.  Rath, S. P. / Umesh, S. / Sarkar, A. K. (2009): "Using VTLN matrices for rapid and computationally-efficient speaker adaptation with robustness to first-pass transcription errors", In INTERSPEECH-2009, 572-575.