A Combination of Model-Based and Feature-Based Strategy for Speech-to-Singing Alignment

Bidisha Sharma, Haizhou Li


Speech and singing are different in many ways. In this work, we propose a novel method to align phonetically identical spoken lyric with a singing vocal in a speech-singing parallel corpus, that is needed in speech-to-singing conversion. We attempt to align speech to singing vocal using a combination of model-based forced alignment and feature-based dynamic time warping (DTW). We first obtain the word boundaries of speech and singing vocals with forced alignment using speech and singing adapted acoustic models, respectively. We consider that speech acoustic models are more accurate than singing acoustic models, therefore, boundaries of spoken words are more accurate than sung words. By searching in the neighborhood of the sung word boundaries in the singing vocal, we hope to improve the alignment between spoken words and sung words. Considering the word boundaries as landmark, we perform speech-to-singing alignment at frame-level using DTW. The proposed method is able to achieve a 47.5% reduction in terms of word boundary error over the baseline, and subsequent improvement of singing quality in a speech-to-singing conversion system.


 DOI: 10.21437/Interspeech.2019-1942

Cite as: Sharma, B., Li, H. (2019) A Combination of Model-Based and Feature-Based Strategy for Speech-to-Singing Alignment. Proc. Interspeech 2019, 624-628, DOI: 10.21437/Interspeech.2019-1942.


@inproceedings{Sharma2019,
  author={Bidisha Sharma and Haizhou Li},
  title={{A Combination of Model-Based and Feature-Based Strategy for Speech-to-Singing Alignment}},
  year=2019,
  booktitle={Proc. Interspeech 2019},
  pages={624--628},
  doi={10.21437/Interspeech.2019-1942},
  url={http://dx.doi.org/10.21437/Interspeech.2019-1942}
}