Second International Conference on Spoken Language Processing (ICSLP'92)
Banff, Alberta, Canada
Adapting a speech recognition system to a new-speaker may require parameter space transformation. By this transformation the space of the new speaker is mapped into a reference space, thus improving the quality of the similarity measure.
The transformation is determined using vector pairs given by Dynamic Time Warping (DTW) for template-based systems or, equivalently by Viterbi alignment for hidden Markov model based systems, of utterances of the reference speaker against that of a new speaker.
Traditionally, the vector pairs are given by a single DTW. It is assumed that the vector-wise acoustic similarity measure in DTW actually measures the phonetic similarity. However, since utterances by different speakers correspond to different spectral spaces which can be totally different, the acoustic similarity measure used by DTW may not correspond to phonetic similarity. From a phonetic point of view, the acoustic similarity measure may have no meaning.
We propose to optimize the alignment by iterative DTWs, in order to reduce incorrect vector mappings. The principle consists in progressively moving, by applying transformation, the parameter space occupied by the new speaker's utterance, towards the space of the reference speaker's. At each move, a new alignment is determined, which is presumed to be phonetically better than the previous one. Using this alignment, a new transformation is estimated and applied. The procedure stops when no alignment improvement can be observed. Experiments show significant decrease of alignment error.
Bibliographic reference. Gong, Yifan / Siohan, Olivier / Haton, Jean-Paul (1992): "Minimization of speech alignment error by iterative transformation for speaker adaptation", In ICSLP-1992, 377-380.