ISCA - 2025 Best Papers

ISCA - International Speech
Communication Association

Home
2025 Best Papers

ISCA Best Paper Awards - 2025

We would like to highlight the award-winning papers!

Each year ISCA awards 3 best student papers at Interspeech based on anonymous reviewing and presentation at the conference. The Interspeech Area Chairs nominate candidate papers that are assessed by a jury with representatives from the ISCA Board, Area Chairs and the Interspeech Technical Program Chairs. The jury for the best student paper award is impartial, i.e. members cannot participate in the voting if (s)he is in any way involved in/with any of the award candidate. Each paper is awarded 500 euros to be split between the student authors. Best Papers of the journals Speech Communication, and Computer Speech and Language are also announced by ISCA during Interspeech.

Please see best paper awards going back to 2000 here.

Please see recent best paper awards: 2024.

ISCA Award for Best Student Paper (students in bold)

On the Relationship between Accent Strength and Articulatory Features

Kevin Huang, Sean Foley, Jihwan Lee, Yoonjeong Lee, Dani Byrd, Shrikanth Narayanan [pdf]

Abstract: This paper explores the relationship between accent strength and articulatory features inferred from acoustic speech. To quantify accent strength, we compare phonetic transcriptions with transcriptions based on dictionary-based references, computing phoneme-level difference as a measure of accent strength. The proposed framework leverages recent self-supervised learning articulatory inversion techniques to estimate articulatory features. Analyzing a corpus of read speech from American and British English speakers, this study examines correlations between derived articulatory parameters and accent strength proxies, associating systematic articulatory differences with indexed accent strength. Results indicate that tongue positioning patterns distinguish the two dialects, with notable differences inter-dialects in rhotic and low back vowels. These findings contribute to automated accent analysis and articulatory modeling for speech processing applications.

OWSM v4: Improving Open Whisper-Style Speech Models via Data Scaling and Cleaning Models

Yifan Peng, Muhammad Shakeel, Yui Sudo, William Chen, Jinchuan Tian, Chyi-Jiunn Lin and Shinji Watanabe [pdf]

Abstract: The Open Whisper-style Speech Models (OWSM) project has developed a series of fully open speech foundation models using academic-scale resources, but their training data remains insufficient. This work enhances OWSM by integrating YODAS, a large-scale web-crawled dataset with a Creative Commons license. However, incorporating YODAS is nontrivial due to its wild nature, which introduces challenges such as incorrect language labels and audio-text misalignments. To address this, we develop a scalable data-cleaning pipeline using public toolkits, yielding a dataset with 166,000 hours of speech across 75 languages. Our new series of OWSM v4 models, trained on this curated dataset alongside existing OWSM data, significantly outperform previous versions on multilingual benchmarks. Our models even match or surpass frontier industrial models like Whisper and MMS in multiple scenarios. We will publicly release the cleaned YODAS data, pre-trained models, and all associated scripts via the ESPnet toolkit.

Attention Models and Auditory Transduction Features for Noise Robustness

Cathal Ó Faoláin and Andrew Hines [pdf]

Abstract: Human abilities surpass current speech processing systems in complex, noisy environments. While popular inputs for Automatic Speech Recognition (ASR) systems, such as raw acoustic signals and Mel spectrograms, perform well in quiet conditions, their effectiveness declines in noise. A recently developed generative WaveNet-based model emulates human auditory transduction in real time, offering alternative input features through its “IHCogram” outputs. We investigate these IHCograms across various Signal-to-Noise ratios (SNRs) using state-of-the-art feature encoders. Our findings show that IHCograms significantly enhance phoneme recognition in noisy conditions with minimal computational overhead, regardless of the model encoder used. Additionally, we introduce our Attention Feature Encoder (AFE) models, which leverage the channel structure of IHCograms and demonstrate superior size and performance compared to existing feature encoders.

ISCA Award for the Best Research Paper published in Computer Speech and Language (2020-2024)

Transfer learning from adult to children for speech recognition: Evaluation, analysis and recommendations

Prashanth Gurunath Shivakumar and Panayiotis Georgiou, Computer Speech & Language, Volume 63, 2020 [link]

Abstract: Children speech recognition is challenging mainly due to the inherent high variability in children’s physical and articulatory characteristics and expressions. This variability manifests in both acoustic constructs and linguistic usage due to the rapidly changing developmental stage in children’s life. Part of the challenge is due to the lack of large amounts of available children speech data for efficient modeling. This work attempts to address the key challenges using transfer learning from adult’s models to children’s models in a Deep Neural Network (DNN) framework for children’s Automatic Speech Recognition (ASR) task evaluating on multiple children’s speech corpora with a large vocabulary. The paper presents a systematic and an extensive analysis of the proposed transfer learning technique considering the key factors affecting children’s speech recognition from prior literature. Evaluations are presented on (i) comparisons of earlier GMM-HMM and the newer DNN Models, (ii) effectiveness of standard adaptation techniques versus transfer learning, (iii) various adaptation configurations in tackling the variabilities present in children speech, in terms of (a) acoustic spectral variability, and (b) pronunciation variability and linguistic constraints. Our Analysis spans over (i) number of DNN model parameters (for adaptation), (ii) amount of adaptation data, (iii) ages of children, (iv) age dependent-independent adaptation. Finally, we provide Recommendations on (i) the favorable strategies over various aforementioned - analyzed parameters, and (ii) potential future research directions and relevant challenges/problems persisting in DNN based ASR for children’s speech.

ISCA Award for the Best Paper published in Computer Speech and Language (2020-2024)

Turn-taking in Conversational Systems and Human-Robot Interaction: A Review

Gabriel Skantze, Computer Speech & Language, Computer Speech and Language, Volume 67, 2021 [link]

Abstract: The taking of turns is a fundamental aspect of dialogue. Since it is difficult to speak and listen at the same time, the participants need to coordinate who is currently speaking and when the next person can start to speak. Humans are very good at this coordination, and typically achieve fluent turn-taking with very small gaps and little overlap. Conversational systems (including voice assistants and social robots), on the other hand, typically have problems with frequent interruptions and long response delays, which has called for a substantial body of research on how to improve turn-taking in conversational systems. In this review article, we provide an overview of this research and give directions for future research. First, we provide a theoretical background of the linguistic research tradition on turn-taking and some of the fundamental concepts in theories of turn-taking. We also provide an extensive review of multi-modal cues (including verbal cues, prosody, breathing, gaze and gestures) that have been found to facilitate the coordination of turn-taking in human-human interaction, and which can be utilised for turn-taking in conversational systems. After this, we review work that has been done on modelling turn-taking, including end-of-turn detection, handling of user interruptions, generation of turn-taking cues, and multi-party human-robot interaction. Finally, we identify key areas where more research is needed to achieve fluent turn-taking in spoken interaction between man and machine.

Organisation	Events	Membership	Help
> Board	> Interspeech	> Join - renew	> Sitemap
> Legal documents	> Workshops	> Membership directory	> Contact
> Logos			> FAQ
			> Privacy policy