ISCA Archive Interspeech 2023
ISCA Archive Interspeech 2023

Controlling Multi-Class Human Vocalization Generation via a Simple Segment-based Labeling Scheme

Hieu-Thi Luong, Junichi Yamagishi

As prompt-based generative models have received much attention, many studies have proposed a similar model for sound generation. While prompt-based generative models have an intuitive interface for non-professional users to experiment with, they lack the ability to control the generated sounds via a more direct means. In this work, we investigated the use of a simple segment-based labeling scheme for human vocalization generation, which is a specific subset of sound generation. By conditioning the generative models on the label sequence which marks the vocalization class of the segment, the generated sound can be controlled in a more detailed manner while maintaining a simple and intuitive input interface. Our experiments showed that simply switching the label scheme from global to segment-based does not degrade the quality of the generated samples in any way and provides a new method of controlling the generation process.


doi: 10.21437/Interspeech.2023-1175

Cite as: Luong, H.-T., Yamagishi, J. (2023) Controlling Multi-Class Human Vocalization Generation via a Simple Segment-based Labeling Scheme. Proc. INTERSPEECH 2023, 4379-4383, doi: 10.21437/Interspeech.2023-1175

@inproceedings{luong23_interspeech,
  author={Hieu-Thi Luong and Junichi Yamagishi},
  title={{Controlling Multi-Class Human Vocalization Generation via a Simple Segment-based Labeling Scheme}},
  year=2023,
  booktitle={Proc. INTERSPEECH 2023},
  pages={4379--4383},
  doi={10.21437/Interspeech.2023-1175}
}