A Strategy for Improved Phone-Level Lyrics-to-Audio Alignment for Speech-to-Singing Synthesis

David Ayllón, Fernando Villavicencio, Pierre Lanchantin


Speech-to-Singing refers to techniques that transform speech to a singing voice. A major performance factor of this process relies on the precision to align the phonetic sequence of the input speech to the timing of the target singing. Unfortunately, the precision of existing techniques for phone-level lyrics-to-audio alignment has been found insufficient for this task. We propose a complete pipeline for automatic phone-level lyrics-to-audio alignment based on an HMM-based forced-aligner and singing acoustics normalization. The system obtains phone-level precision in the range of a few tens of milliseconds as we report in the objective evaluation. The subjective evaluation reveals that the smoothness of the singing voice generated with the proposed methodology was found close to the one obtained using manual alignments.


 DOI: 10.21437/Interspeech.2019-3049

Cite as: Ayllón, D., Villavicencio, F., Lanchantin, P. (2019) A Strategy for Improved Phone-Level Lyrics-to-Audio Alignment for Speech-to-Singing Synthesis. Proc. Interspeech 2019, 2603-2607, DOI: 10.21437/Interspeech.2019-3049.


@inproceedings{Ayllón2019,
  author={David Ayllón and Fernando Villavicencio and Pierre Lanchantin},
  title={{A Strategy for Improved Phone-Level Lyrics-to-Audio Alignment for Speech-to-Singing Synthesis}},
  year=2019,
  booktitle={Proc. Interspeech 2019},
  pages={2603--2607},
  doi={10.21437/Interspeech.2019-3049},
  url={http://dx.doi.org/10.21437/Interspeech.2019-3049}
}