As prompt-based generative models have received much attention, many studies have proposed a similar model for sound generation. While prompt-based generative models have an intuitive interface for non-professional users to experiment with, they lack the ability to control the generated sounds via a more direct means. In this work, we investigated the use of a simple segment-based labeling scheme for human vocalization generation, which is a specific subset of sound generation. By conditioning the generative models on the label sequence which marks the vocalization class of the segment, the generated sound can be controlled in a more detailed manner while maintaining a simple and intuitive input interface. Our experiments showed that simply switching the label scheme from global to segment-based does not degrade the quality of the generated samples in any way and provides a new method of controlling the generation process.
Cite as: Luong, H.-T., Yamagishi, J. (2023) Controlling Multi-Class Human Vocalization Generation via a Simple Segment-based Labeling Scheme. Proc. INTERSPEECH 2023, 4379-4383, doi: 10.21437/Interspeech.2023-1175
@inproceedings{luong23_interspeech, author={Hieu-Thi Luong and Junichi Yamagishi}, title={{Controlling Multi-Class Human Vocalization Generation via a Simple Segment-based Labeling Scheme}}, year=2023, booktitle={Proc. INTERSPEECH 2023}, pages={4379--4383}, doi={10.21437/Interspeech.2023-1175} }