Due to physiology and linguistic difference between speakers, the spectrum pattern for the same phoneme of two speakers can be quite dissimilar. Without appropriate alignment on the frequency axis, the inter-speaker variation will reduce the modeling efficiency and result in performance degradation. In this paper, a novel data-driven framework is proposed to build the alignment of the frequency axes of two speakers. This alignment between two frequency axes is essentially a frequency domain correspondence of these two speakers. To establish the frequency domain correspondence, we formulate the task as an optimal matching problem. The local matching is achieved by comparing the local features of the spectrogram along the frequency bins. This local matching is actually capturing the similarity of the local patterns along different frequency bins in the spectrogram. After the local matching, a dynamic programming is then applied to find the global optimal alignment between two frequency axes. Experiments on TIDIGITS and TIMIT clearly show the effectiveness of this method.
Bibliographic reference. Liu, Ming / Zhou, Xi / Hasegawa-Johnson, Mark / Huang, Thomas S. / Zhang, Zhengyou (2007): "Frequency domain correspondence for speaker normalization", In INTERSPEECH-2007, 274-277.