ISCA - International Speech
Communication Association


  • Home
  • 2024 Best Papers

   

ISCA Best Paper Awards - 2024

We would like to highlight the award-winning papers!

Each year ISCA awards 3 best student papers at Interspeech based on anonymous reviewing and presentation at the conference. The Interspeech Area Chairs nominate candidate papers that are assessed by a jury with representatives from the ISCA Board, Area Chairs and the Interspeech Technical Program Chairs. The jury for the best student paper award is impartial, i.e. members cannot participate in the voting if (s)he is in any way involved in/with any of the award candidate. Each paper is awarded 500 euros to be split between the student authors. Best Papers of the journals Speech Communication, and Computer Speech and Language are also announced by ISCA during Interspeech.

Please see best paper awards going back to 2000 here.

ISCA Award for Best Student Paper (students in bold)





Analysis of articulatory setting for L1 and L2 English speakers using MRI data

Kevin Huang, Jack Goldberg, Louis Goldstein and Shrikanth Narayanan [pdf]

Image

Abstract: This paper investigates the extent to which the geographical region (country) where a speaker acquired their English language affects the articulatory setting in their speech. To obtain accurate measurements for evaluating articulatory setting, we utilized a large real-time MRI corpus of vocal tract articulation. The corpus was obtained from speakers from a variety of linguistic backgrounds producing continuous English speech. We use an automated pipeline to process and extract articulatory positional information from the MRI video data. This data is used to draw comparisons between English language speakers from the United States and speakers who acquired their English in India, Korea, and China. Analysis of the speaker groups reveals statistically significant articulatory setting posture differences in multiple places of articulation.




A Contrastive Learning Approach to Mitigate Bias in Speech Models

Alkis Koudounas, Flavio Giobergia, Eliana Pastor and Elena Baralis [pdf]

Image

Abstract: Speech models may be affected by performance imbalance in different population subgroups, raising concerns about fair treatment across these groups. Prior attempts to mitigate unfairness either focus on user-defined subgroups, potentially overlooking other affected subgroups, or do not explicitly improve the internal representation at the subgroup level. This paper proposes the first adoption of contrastive learning to mitigate speech model bias in underperforming subgroups. We employ a three-level learning technique that guides the model in focusing on different scopes for the contrastive loss, i.e., task, subgroup, and the errors within subgroups. The experiments on two spoken language understanding datasets and two languages demonstrate that our approach improves internal subgroup representations, thus reducing model bias and enhancing performance.




SimpleSpeech: Towards Simple and Efficient Text-to-Speech with Scalar Latent Transformer Diffusion Models

Dongchao Yang, Dingdong Wang, Haohan Guo, Xueyuan Chen, Xixin Wu and Helen Meng [pdf]

Image

Abstract: In this study, we propose a simple and efficient Non- Autoregressive (NAR) text-to-speech (TTS) system based on diffusion, named SimpleSpeech. Its simpleness shows in three aspects: (1) It can be trained on the speech-only dataset, without any alignment information; (2) It directly takes plain text as input and generates speech through an NAR way; (3) It tries to model speech in a finite and compact latent space, which alleviates the modeling difficulty of diffusion. More specifically, we propose a novel speech codec model (SQ-Codec) with scalar quantization, SQ-Codec effectively maps the complex speech signal into a finite and compact latent space, named scalar latent space. Benefits from SQ-Codec, we apply a novel transformer diffusion model in the scalar latent space of SQ-Codec. We train SimpleSpeech on 4k hours of a speech-only dataset, it shows natural prosody and voice cloning ability. Compared with previous large-scale TTS models, it presents significant speech quality and generation speed improvement. Demos are released..

ISCA Award for the Best Research Paper published in Computer Speech and Language (2019-2023)





Identifying Mild Cognitive Impairment and mild Alzheimer’s disease based on spontaneous speech using ASR and linguistic features

Gábor Gosztolya, Veronika Vincze, László Tóth, Magdolna Pákáski, János Kálmán, Ildikó Hoffmann, Computer Speech & Language, Volume 53, Pages 181-197, January 2019 [link]

Image

Abstract: Alzheimer’s disease (AD) is a neurodegenerative disorder that develops for years before clinical manifestation, while mild cognitive impairment is clinically considered as a prodromal stage of AD. For both types of neurodegenerative disorders, early diagnosis is crucial for the timely treatment and to decelerate progression. Unfortunately, the current diagnostic solutions are time-consuming. Here, we seek to exploit the observation that these illnesses frequently disturb the mental and linguistic functions, which might be detected from the spontaneous speech produced by the patient. First, we present an automatic speech recognition based procedure for the extraction of a special set of acoustic features. Second, we present a linguistic feature set that is extracted from the transcripts of the same speech signals. The usefulness of the two feature sets is evaluated via machine learning experiments, where our goal is not only to differentiate between the patients and the healthy control group, but also to tell apart Alzheimer’s patients from those with mild cognitive impairment. Our results show that based on only the acoustic features, we are able to separate the various groups with accuracy scores between 74–82%. We attained similar accuracy scores when using only the linguistic features. With the combination of the two types of features, the accuracy scores rise to between 80–86%, and the corresponding F1 values also fall between 78–86%. We hope that with the full automation of the processing chain, our method can serve as the basis of an automatic screening test in the future.

  

ISCA Award for the Best Paper published in Speech Communication (2019-2023)





CN-Celeb: Multi-genre speaker recognition

Lantian Li, Ruiqi Liu, Jiawen Kang, Yue Fan, Hao Cui, Yunqi Cai, Ravichander Vipperla, Thomas Fang Zheng, Dong Wang, Speech Communication, Volume 137, Pages 77-91, February 2022 [link]

Image

Abstract: Research on speaker recognition is extending to address the vulnerability in the wild conditions, among which genre mismatch is perhaps the most challenging, for instance, enrollment with reading speech while testing with conversational or singing audio. This mismatch leads to complex and composite inter-session variations, both intrinsic (i.e., speaking style, physiological status) and extrinsic (i.e., recording device, background noise). Unfortunately, the few existing multi-genre corpora are not only limited in size but are also recorded under controlled conditions, which cannot support conclusive research on the multi-genre problem. In this work, we firstly publish CN-Celeb, a large-scale multi-genre corpus that includes in-the-wild speech utterances of 3000 speakers in 11 different genres. Secondly, using this dataset, we conduct a comprehensive study on the multi-genre phenomenon, in particular the impact of the multi-genre challenge on speaker recognition and the performance gain when the new dataset is used to conduct multi-genre training.

 Organisation  Events   Membership   Help 
 > Board  > Interspeech  > Join - renew  > Sitemap
 > Legal documents  > Workshops  > Membership directory  > Contact
 > Logos      > FAQ
       > Privacy policy

© Copyright 2024 - ISCA International Speech Communication Association - All right reserved.

Powered by Wild Apricot Membership Software