Interspeech 2020

25-29 October 2020, Shanghai

General Chair: Helen Meng, General Co-Chairs: Bo Xu and Thomas Zheng

ISSN: 1990-9772
DOI: 10.21437/Interspeech.2020

Keynote 1


The cognitive status of simple and complex models
Janet B. Pierrehumbert


ASR Neural Network Architectures I


On the Comparison of Popular End-to-End Models for Large Scale Speech Recognition
Jinyu Li, Yu Wu, Yashesh Gaur, Chengyi Wang, Rui Zhao, Shujie Liu

SAN-M: Memory Equipped Self-Attention for End-to-End Speech Recognition
Zhifu Gao, Shiliang Zhang, Ming Lei, Ian McLoughlin

Contextual RNN-T for Open Domain ASR
Mahaveer Jain, Gil Keren, Jay Mahadeokar, Geoffrey Zweig, Florian Metze, Yatharth Saraf

ASAPP-ASR: Multistream CNN and Self-Attentive SRU for SOTA Speech Recognition
Jing Pan, Joshua Shapiro, Jeremy Wohlwend, Kyu J. Han, Tao Lei, Tao Ma

Compressing LSTM Networks with Hierarchical Coarse-Grain Sparsity
Deepak Kadetotad, Jian Meng, Visar Berisha, Chaitali Chakrabarti, Jae-sun Seo

BLSTM-Driven Stream Fusion for Automatic Speech Recognition: Novel Methods and a Multi-Size Window Fusion Example
Timo Lohrenz, Tim Fingscheidt

Relative Positional Encoding for Speech Recognition and Direct Translation
Ngoc-Quan Pham, Thanh-Le Ha, Tuan-Nam Nguyen, Thai-Son Nguyen, Elizabeth Salesky, Sebastian Stüker, Jan Niehues, Alex Waibel

Joint Speaker Counting, Speech Recognition, and Speaker Identification for Overlapped Speech of any Number of Speakers
Naoyuki Kanda, Yashesh Gaur, Xiaofei Wang, Zhong Meng, Zhuo Chen, Tianyan Zhou, Takuya Yoshioka

Implicit Transfer of Privileged Acoustic Information in a Generalized Knowledge Distillation Framework
Takashi Fukuda, Samuel Thomas

Effect of Adding Positional Information on Convolutional Neural Networks for End-to-End Speech Recognition
Jinhwan Park, Wonyong Sung


Multi-Channel Speech Enhancement


Deep Neural Network-Based Generalized Sidelobe Canceller for Robust Multi-Channel Speech Recognition
Guanjun Li, Shan Liang, Shuai Nie, Wenju Liu, Zhanlei Yang, Longshuai Xiao

Neural Spatio-Temporal Beamformer for Target Speech Separation
Yong Xu, Meng Yu, Shi-Xiong Zhang, Lianwu Chen, Chao Weng, Jianming Liu, Dong Yu

Online Directional Speech Enhancement Using Geometrically Constrained Independent Vector Analysis
Li Li, Kazuhito Koishida, Shoji Makino

End-to-End Multi-Look Keyword Spotting
Meng Yu, Xuan Ji, Bo Wu, Dan Su, Dong Yu

Differential Beamforming for Uniform Circular Array with Directional Microphones
Weilong Huang, Jinwei Feng

Exploring Deep Hybrid Tensor-to-Vector Network Architectures for Regression Based Speech Enhancement
Jun Qi, Hu Hu, Yannan Wang, Chao-Han Huck Yang, Sabato Marco Siniscalchi, Chin-Hui Lee

An End-to-End Architecture of Online Multi-Channel Speech Separation
Jian Wu, Zhuo Chen, Jinyu Li, Takuya Yoshioka, Zhili Tan, Edward Lin, Yi Luo, Lei Xie

Mentoring-Reverse Mentoring for Unsupervised Multi-Channel Speech Source Separation
Yu Nakagome, Masahito Togami, Tetsuji Ogawa, Tetsunori Kobayashi

Computationally Efficient and Versatile Framework for Joint Optimization of Blind Speech Separation and Dereverberation
Tomohiro Nakatani, Rintaro Ikeshita, Keisuke Kinoshita, Hiroshi Sawada, Shoko Araki

A Space-and-Speaker-Aware Iterative Mask Estimation Approach to Multi-Channel Speech Recognition in the CHiME-6 Challenge
Yan-Hui Tu, Jun Du, Lei Sun, Feng Ma, Jia Pan, Chin-Hui Lee


Speech Processing in the Brain


Identifying Causal Relationships Between Behavior and Local Brain Activity During Natural Conversation
Hmamouche Youssef, Prévot Laurent, Ochs Magalie, Chaminade Thierry

Neural Entrainment to Natural Speech Envelope Based on Subject Aligned EEG Signals
Di Zhou, Gaoyan Zhang, Jianwu Dang, Shuang Wu, Zhuo Zhang

Does Lexical Retrieval Deteriorate in Patients with Mild Cognitive Impairment? Analysis of Brain Functional Network Will Tell
Chongyuan Lian, Tianqi Wang, Mingxiao Gu, Manwa L. Ng, Feiqi Zhu, Lan Wang, Nan Yan

Congruent Audiovisual Speech Enhances Cortical Envelope Tracking During Auditory Selective Attention
Zhen Fu, Jing Chen

Contribution of RMS-Level-Based Speech Segments to Target Speech Decoding Under Noisy Conditions
Lei Wang, Ed X. Wu, Fei Chen

Cortical Oscillatory Hierarchy for Natural Sentence Processing
Bin Zhao, Jianwu Dang, Gaoyan Zhang, Masashi Unoki

Comparing EEG Analyses with Different Epoch Alignments in an Auditory Lexical Decision Experiment
Louis ten Bosch, Kimberley Mulder, Lou Boves

Detection of Subclinical Mild Traumatic Brain Injury (mTBI) Through Speech and Gait
Tanya Talkar, Sophia Yuditskaya, James R. Williamson, Adam C. Lammert, Hrishikesh Rao, Daniel Hannon, Anne O’Brien, Gloria Vergara-Diaz, Richard DeLaura, Douglas Sturim, Gregory Ciccarelli, Ross Zafonte, Jeffrey Palmer, Paolo Bonato, Thomas F. Quatieri


Speech Signal Representation


Towards Learning a Universal Non-Semantic Representation of Speech
Joel Shor, Aren Jansen, Ronnie Maor, Oran Lang, Omry Tuval, Félix de Chaumont Quitry, Marco Tagliasacchi, Ira Shavitt, Dotan Emanuel, Yinnon Haviv

Poetic Meter Classification Using i-Vector-MTF Fusion
Rajeev Rajan, Aiswarya Vinod Kumar, Ben P. Babu

Formant Tracking Using Dilated Convolutional Networks Through Dense Connection with Gating Mechanism
Wang Dai, Jinsong Zhang, Yingming Gao, Wei Wei, Dengfeng Ke, Binghuai Lin, Yanlu Xie

Automatic Analysis of Speech Prosody in Dutch
Na Hu, Berit Janssen, Judith Hanssen, Carlos Gussenhoven, Aoju Chen

Learning Voice Representation Using Knowledge Distillation for Automatic Voice Casting
Adrien Gresse, Mathias Quillot, Richard Dufour, Jean-François Bonastre

Enhancing Formant Information in Spectrographic Display of Speech
B. Yegnanarayana, Anand Joseph, Vishala Pannala

Unsupervised Methods for Evaluating Speech Representations
Michael Gump, Wei-Ning Hsu, James Glass

Robust Pitch Regression with Voiced/Unvoiced Classification in Nonstationary Noise Environments
Dung N. Tran, Uros Batricevic, Kazuhito Koishida

Nonlinear ISA with Auxiliary Variables for Learning Speech Representations
Amrith Setlur, Barnabás Póczos, Alan W. Black

Harmonic Lowering for Accelerating Harmonic Convolution for Audio Signals
Hirotoshi Takeuchi, Kunio Kashino, Yasunori Ohishi, Hiroshi Saruwatari


Speech Synthesis: Neural Waveform Generation I


Knowledge-and-Data-Driven Amplitude Spectrum Prediction for Hierarchical Neural Vocoders
Yang Ai, Zhen-Hua Ling

FeatherWave: An Efficient High-Fidelity Neural Vocoder with Multi-Band Linear Prediction
Qiao Tian, Zewang Zhang, Heng Lu, Ling-Hui Chen, Shan Liu

VocGAN: A High-Fidelity Real-Time Vocoder with a Hierarchically-Nested Adversarial Network
Jinhyeok Yang, Junmo Lee, Youngik Kim, Hoon-Young Cho, Injung Kim

Lightweight LPCNet-Based Neural Vocoder with Tensor Decomposition
Hiroki Kanagawa, Yusuke Ijima

WG-WaveNet: Real-Time High-Fidelity Speech Synthesis Without GPU
Po-chun Hsu, Hung-yi Lee

What the Future Brings: Investigating the Impact of Lookahead for Incremental Neural TTS
Brooke Stephenson, Laurent Besacier, Laurent Girin, Thomas Hueber

Fast and Lightweight On-Device TTS with Tacotron2 and LPCNet
Vadim Popov, Stanislav Kamenev, Mikhail Kudinov, Sergey Repyevsky, Tasnima Sadekova, Vitalii Bushaev, Vladimir Kryzhanovskiy, Denis Parkhomenko

Efficient WaveGlow: An Improved WaveGlow Vocoder with Enhanced Speed
Wei Song, Guanghui Xu, Zhengchen Zhang, Chao Zhang, Xiaodong He, Bowen Zhou

Can Auditory Nerve Models Tell us What’s Different About WaveNet Vocoded Speech?
Sébastien Le Maguer, Naomi Harte

Speaker Conditional WaveRNN: Towards Universal Neural Vocoder for Unseen Speaker and Recording Conditions
Dipjyoti Paul, Yannis Pantazis, Yannis Stylianou

Neural Homomorphic Vocoder
Zhijun Liu, Kuan Chen, Kai Yu


Automatic Speech Recognition for Non-Native Children’s Speech


Overview of the Interspeech TLT2020 Shared Task on ASR for Non-Native Children’s Speech
Roberto Gretter, Marco Matassoni, Daniele Falavigna, Keelan Evanini, Chee Wee Leong

The NTNU System at the Interspeech 2020 Non-Native Children’s Speech ASR Challenge
Tien-Hong Lo, Fu-An Chao, Shi-Yan Weng, Berlin Chen

Non-Native Children’s Automatic Speech Recognition: The INTERSPEECH 2020 Shared Task ALTA Systems
Kate M. Knill, Linlin Wang, Yu Wang, Xixin Wu, Mark J.F. Gales

Data Augmentation Using Prosody and False Starts to Recognize Non-Native Children’s Speech
Hemant Kathania, Mittul Singh, Tamás Grósz, Mikko Kurimo

UNSW System Description for the Shared Task on Automatic Speech Recognition for Non-Native Children’s Speech
Mostafa Shahin, Renée Lu, Julien Epps, Beena Ahmed


Speaker Diarization


End-to-End Speaker Diarization for an Unknown Number of Speakers with Encoder-Decoder Based Attractors
Shota Horiguchi, Yusuke Fujita, Shinji Watanabe, Yawen Xue, Kenji Nagamatsu

Target-Speaker Voice Activity Detection: A Novel Approach for Multi-Speaker Diarization in a Dinner Party Scenario
Ivan Medennikov, Maxim Korenevsky, Tatiana Prisyach, Yuri Khokhlov, Mariya Korenevskaya, Ivan Sorokin, Tatiana Timofeeva, Anton Mitrofanov, Andrei Andrusenko, Ivan Podluzhny, Aleksandr Laptev, Aleksei Romanenko

New Advances in Speaker Diarization
Hagai Aronowitz, Weizhong Zhu, Masayuki Suzuki, Gakuto Kurata, Ron Hoory

Self-Attentive Similarity Measurement Strategies in Speaker Diarization
Qingjian Lin, Yu Hou, Ming Li

Speaker Attribution with Voice Profiles by Graph-Based Semi-Supervised Learning
Jixuan Wang, Xiong Xiao, Jian Wu, Ranjani Ramamurthy, Frank Rudzicz, Michael Brudno

Deep Self-Supervised Hierarchical Clustering for Speaker Diarization
Prachi Singh, Sriram Ganapathy

Spot the Conversation: Speaker Diarisation in the Wild
Joon Son Chung, Jaesung Huh, Arsha Nagrani, Triantafyllos Afouras, Andrew Zisserman


Noise Robust and Distant Speech Recognition


Learning Contextual Language Embeddings for Monaural Multi-Talker Speech Recognition
Wangyou Zhang, Yanmin Qian

Double Adversarial Network Based Monaural Speech Enhancement for Robust Speech Recognition
Zhihao Du, Jiqing Han, Xueliang Zhang

Anti-Aliasing Regularization in Stacking Layers
Antoine Bruguier, Ananya Misra, Arun Narayanan, Rohit Prabhavalkar

Towards a Competitive End-to-End Speech Recognition for CHiME-6 Dinner Party Transcription
Andrei Andrusenko, Aleksandr Laptev, Ivan Medennikov

End-to-End Far-Field Speech Recognition with Unified Dereverberation and Beamforming
Wangyou Zhang, Aswin Shanmugam Subramanian, Xuankai Chang, Shinji Watanabe, Yanmin Qian

Quaternion Neural Networks for Multi-Channel Distant Speech Recognition
Xinchi Qiu, Titouan Parcollet, Mirco Ravanelli, Nicholas D. Lane, Mohamed Morchid

Improved Guided Source Separation Integrated with a Strong Back-End for the CHiME-6 Dinner Party Scenario
Hangting Chen, Pengyuan Zhang, Qian Shi, Zuozhen Liu

Neural Speech Separation Using Spatially Distributed Microphones
Dongmei Wang, Zhuo Chen, Takuya Yoshioka

Utterance-Wise Meeting Transcription System Using Asynchronous Distributed Microphones
Shota Horiguchi, Yusuke Fujita, Kenji Nagamatsu

Simulating Realistically-Spatialised Simultaneous Speech Using Video-Driven Speaker Detection and the CHiME-5 Dataset
Jack Deadman, Jon Barker


Speech in Multimodality


Toward Silent Paralinguistics: Speech-to-EMG — Retrieving Articulatory Muscle Activity from Speech
Catarina Botelho, Lorenz Diener, Dennis Küster, Kevin Scheck, Shahin Amiriparian, Björn W. Schuller, Tanja Schultz, Alberto Abad, Isabel Trancoso

Multimodal Deception Detection Using Automatically Extracted Acoustic, Visual, and Lexical Features
Jiaxuan Zhang, Sarah Ita Levitan, Julia Hirschberg

Multi-Modal Attention for Speech Emotion Recognition
Zexu Pan, Zhaojie Luo, Jichen Yang, Haizhou Li

WISE: Word-Level Interaction-Based Multimodal Fusion for Speech Emotion Recognition
Guang Shen, Riwei Lai, Rui Chen, Yu Zhang, Kejia Zhang, Qilong Han, Hongtao Song

A Multi-Scale Fusion Framework for Bimodal Speech Emotion Recognition
Ming Chen, Xudong Zhao

Group Gated Fusion on Attention-Based Bidirectional Alignment for Multimodal Emotion Recognition
Pengfei Liu, Kun Li, Helen Meng

Multi-Modal Embeddings Using Multi-Task Learning for Emotion Recognition
Aparna Khare, Srinivas Parthasarathy, Shiva Sundaram

Using Speaker-Aligned Graph Memory Block in Multimodally Attentive Emotion Recognition Network
Jeng-Lin Li, Chi-Chun Lee

Context-Dependent Domain Adversarial Neural Network for Multimodal Emotion Recognition
Zheng Lian, Jianhua Tao, Bin Liu, Jian Huang, Zhanlei Yang, Rongjun Li


Speech, Language, and Multimodal Resources


ATCSpeech: A Multilingual Pilot-Controller Speech Corpus from Real Air Traffic Control Environment
Bo Yang, Xianlong Tan, Zhengmao Chen, Bing Wang, Min Ruan, Dan Li, Zhongping Yang, Xiping Wu, Yi Lin

Developing an Open-Source Corpus of Yoruba Speech
Alexander Gutkin, Işın Demirşahin, Oddur Kjartansson, Clara Rivera, Kọ́lá Túbọ̀sún

ClovaCall: Korean Goal-Oriented Dialog Speech Corpus for Automatic Speech Recognition of Contact Centers
Jung-Woo Ha, Kihyun Nam, Jingu Kang, Sang-Woo Lee, Sohee Yang, Hyunhoon Jung, Hyeji Kim, Eunmi Kim, Soojin Kim, Hyun Ah Kim, Kyoungtae Doh, Chan Kyu Lee, Nako Sung, Sunghun Kim

LAIX Corpus of Chinese Learner English: Towards a Benchmark for L2 English ASR
Yanhong Wang, Huan Luan, Jiahong Yuan, Bin Wang, Hui Lin

Design and Development of a Human-Machine Dialog Corpus for the Automated Assessment of Conversational English Proficiency
Vikram Ramanarayanan

CUCHILD: A Large-Scale Cantonese Corpus of Child Speech for Phonology and Articulation Assessment
Si-Ioi Ng, Cymie Wing-Yee Ng, Jiarui Wang, Tan Lee, Kathy Yuet-Sheung Lee, Michael Chi-Fai Tong

FinChat: Corpus and Evaluation Setup for Finnish Chat Conversations on Everyday Topics
Katri Leino, Juho Leinonen, Mittul Singh, Sami Virpioja, Mikko Kurimo

DiPCo — Dinner Party Corpus
Maarten Van Segbroeck, Ahmed Zaid, Ksenia Kutsenko, Cirenia Huerta, Tinh Nguyen, Xuewen Luo, Björn Hoffmeister, Jan Trmal, Maurizio Omologo, Roland Maas

Learning to Detect Bipolar Disorder and Borderline Personality Disorder with Language and Speech in Non-Clinical Interviews
Bo Wang, Yue Wu, Niall Taylor, Terry Lyons, Maria Liakata, Alejo J. Nevado-Holgado, Kate E.A. Saunders

FT Speech: Danish Parliament Speech Corpus
Andreas Kirkedal, Marija Stepanović, Barbara Plank



Speech Processing and Analysis


ICE-Talk: An Interface for a Controllable Expressive Talking Machine
Noé Tits, Kevin El Haddad, Thierry Dutoit

Kaldi-Web: An Installation-Free, On-Device Speech Recognition System
Mathieu Hu, Laurent Pierron, Emmanuel Vincent, Denis Jouvet

Soapbox Labs Verification Platform for Child Speech
Amelia C. Kelly, Eleni Karamichali, Armin Saeb, Karel Veselý, Nicholas Parslow, Agape Deng, Arnaud Letondor, Robert O’Regan, Qiru Zhou

SoapBox Labs Fluency Assessment Platform for Child Speech
Amelia C. Kelly, Eleni Karamichali, Armin Saeb, Karel Veselý, Nicholas Parslow, Gloria Montoya Gomez, Agape Deng, Arnaud Letondor, Niall Mullally, Adrian Hempel, Robert O’Regan, Qiru Zhou

CATOTRON — A Neural Text-to-Speech System in Catalan
Baybars Külebi, Alp Öktem, Alex Peiró-Lilja, Santiago Pascual, Mireia Farrús

Toward Remote Patient Monitoring of Speech, Video, Cognitive and Respiratory Biomarkers Using Multimodal Dialog Technology
Vikram Ramanarayanan, Oliver Roesler, Michael Neumann, David Pautler, Doug Habberstad, Andrew Cornish, Hardik Kothare, Vignesh Murali, Jackson Liscombe, Dirk Schnelle-Walka, Patrick Lange, David Suendermann-Oeft

VoiceID on the Fly: A Speaker Recognition System that Learns from Scratch
Baihan Lin, Xinxin Zhang



ASR Neural Network Architectures and Training I


Fast and Slow Acoustic Model
Kshitiz Kumar, Emilian Stoimenov, Hosam Khalil, Jian Wu

Self-Distillation for Improving CTC-Transformer-Based ASR Systems
Takafumi Moriya, Tsubasa Ochiai, Shigeki Karita, Hiroshi Sato, Tomohiro Tanaka, Takanori Ashihara, Ryo Masumura, Yusuke Shinohara, Marc Delcroix

Single Headed Attention Based Sequence-to-Sequence Model for State-of-the-Art Results on Switchboard
Zoltán Tüske, George Saon, Kartik Audhkhasi, Brian Kingsbury

Improving Speech Recognition Using GAN-Based Speech Synthesis and Contrastive Unspoken Text Selection
Zhehuai Chen, Andrew Rosenberg, Yu Zhang, Gary Wang, Bhuvana Ramabhadran, Pedro J. Moreno

PyChain: A Fully Parallelized PyTorch Implementation of LF-MMI for End-to-End ASR
Yiwen Shao, Yiming Wang, Daniel Povey, Sanjeev Khudanpur

CAT: A CTC-CRF Based ASR Toolkit Bridging the Hybrid and the End-to-End Approaches Towards Data Efficiency and Low Latency
Keyu An, Hongyu Xiang, Zhijian Ou

CTC-Synchronous Training for Monotonic Attention Model
Hirofumi Inaguma, Masato Mimura, Tatsuya Kawahara

Continual Learning for Multi-Dialect Acoustic Models
Brady Houston, Katrin Kirchhoff

SpecSwap: A Simple Data Augmentation Method for End-to-End Speech Recognition
Xingchen Song, Zhiyong Wu, Yiheng Huang, Dan Su, Helen Meng


Evaluation of Speech Technology Systems and Methods for Resource Construction and Annotation


RECOApy: Data Recording, Pre-Processing and Phonetic Transcription for End-to-End Speech-Based Applications
Adriana Stan

Analyzing the Quality and Stability of a Streaming End-to-End On-Device Speech Recognizer
Yuan Shangguan, Kate Knister, Yanzhang He, Ian McGraw, Françoise Beaufays

Statistical Testing on ASR Performance via Blockwise Bootstrap
Zhe Liu, Fuchun Peng

Sentence Level Estimation of Psycholinguistic Norms Using Joint Multidimensional Annotations
Anil Ramakrishna, Shrikanth Narayanan

Neural Zero-Inflated Quality Estimation Model for Automatic Speech Recognition System
Kai Fan, Bo Li, Jiayi Wang, Shiliang Zhang, Boxing Chen, Niyu Ge, Zhijie Yan

Confidence Measures in Encoder-Decoder Models for Speech Recognition
Alejandro Woodward, Clara Bonnín, Issey Masuda, David Varas, Elisenda Bou-Balust, Juan Carlos Riveiro

Word Error Rate Estimation Without ASR Output: e-WER2
Ahmed Ali, Steve Renals

An Evaluation of Manual and Semi-Automatic Laughter Annotation
Bogdan Ludusan, Petra Wagner

Understanding Racial Disparities in Automatic Speech Recognition: The Case of Habitual “be”
Joshua L. Martin, Kevin Tang



Topics in ASR I


Augmenting Turn-Taking Prediction with Wearable Eye Activity During Conversation
Hang Li, Siyuan Chen, Julien Epps

CAM: Uninteresting Speech Detector
Weiyi Lu, Yi Xu, Peng Yang, Belinda Zeng

Mixed Case Contextual ASR Using Capitalization Masks
Diamantino Caseiro, Pat Rondon, Quoc-Nam Le The, Petar Aleksic

Speech Recognition and Multi-Speaker Diarization of Long Conversations
Huanru Henry Mao, Shuyang Li, Julian McAuley, Garrison W. Cottrell

Investigation of Data Augmentation Techniques for Disordered Speech Recognition
Mengzhe Geng, Xurong Xie, Shansong Liu, Jianwei Yu, Shoukang Hu, Xunying Liu, Helen Meng

A Real-Time Robot-Based Auxiliary System for Risk Evaluation of COVID-19 Infection
Wenqi Wei, Jianzong Wang, Jiteng Ma, Ning Cheng, Jing Xiao

An Utterance Verification System for Word Naming Therapy in Aphasia
David S. Barbera, Mark Huckvale, Victoria Fleming, Emily Upton, Henry Coley-Fisher, Ian Shaw, William Latham, Alexander P. Leff, Jenny Crinion

Exploiting Cross-Domain Visual Feature Generation for Disordered Speech Recognition
Shansong Liu, Xurong Xie, Jianwei Yu, Shoukang Hu, Mengzhe Geng, Rongfeng Su, Shi-Xiong Zhang, Xunying Liu, Helen Meng

Joint Prediction of Punctuation and Disfluency in Speech Transcripts
Binghuai Lin, Liyuan Wang

Focal Loss for Punctuation Prediction
Jiangyan Yi, Jianhua Tao, Zhengkun Tian, Ye Bai, Cunhang Fan


Large-Scale Evaluation of Short-Duration Speaker Verification


Improving X-Vector and PLDA for Text-Dependent Speaker Verification
Zhuxin Chen, Yue Lin

SdSV Challenge 2020: Large-Scale Evaluation of Short-Duration Speaker Verification
Hossein Zeinali, Kong Aik Lee, Jahangir Alam, Lukáš Burget

The XMUSPEECH System for Short-Duration Speaker Verification Challenge 2020
Tao Jiang, Miao Zhao, Lin Li, Qingyang Hong

Robust Text-Dependent Speaker Verification via Character-Level Information Preservation for the SdSV Challenge 2020
Sung Hwan Mun, Woo Hyun Kang, Min Hyun Han, Nam Soo Kim

The TalTech Systems for the Short-Duration Speaker Verification Challenge 2020
Tanel Alumäe, Jörgen Valk

Investigation of NICT Submission for Short-Duration Speaker Verification Challenge 2020
Peng Shen, Xugang Lu, Hisashi Kawai

Cross-Lingual Speaker Verification with Domain-Balanced Hard Prototype Mining and Language-Dependent Score Normalization
Jenthe Thienpondt, Brecht Desplanques, Kris Demuynck

BUT Text-Dependent Speaker Verification System for SdSV Challenge 2020
Alicia Lozano-Diez, Anna Silnova, Bhargav Pulugundla, Johan Rohdin, Karel Veselý, Lukáš Burget, Oldřich Plchot, Ondřej Glembek, Ondvrej Novotný, Pavel Matějka

Exploring the Use of an Unsupervised Autoregressive Model as a Shared Encoder for Text-Dependent Speaker Verification
Vijay Ravi, Ruchao Fan, Amber Afshan, Huanhua Lu, Abeer Alwan


Voice Conversion and Adaptation I


Recognition-Synthesis Based Non-Parallel Voice Conversion with Adversarial Learning
Jing-Xuan Zhang, Zhen-Hua Ling, Li-Rong Dai

Improving the Speaker Identity of Non-Parallel Many-to-Many Voice Conversion with Adversarial Speaker Recognition
Shaojin Ding, Guanlong Zhao, Ricardo Gutierrez-Osuna

Non-Parallel Many-to-Many Voice Conversion with PSR-StarGAN
Yanping Li, Dongxiang Xu, Yan Zhang, Yang Wang, Binbin Chen

TTS Skins: Speaker Conversion via ASR
Adam Polyak, Lior Wolf, Yaniv Taigman

GAZEV: GAN-Based Zero-Shot Voice Conversion Over Non-Parallel Speech Corpus
Zining Zhang, Bingsheng He, Zhenjie Zhang

Spoken Content and Voice Factorization for Few-Shot Speaker Adaptation
Tao Wang, Jianhua Tao, Ruibo Fu, Jiangyan Yi, Zhengqi Wen, Rongxiu Zhong

Unsupervised Cross-Domain Singing Voice Conversion
Adam Polyak, Lior Wolf, Yossi Adi, Yaniv Taigman

Attention-Based Speaker Embeddings for One-Shot Voice Conversion
Tatsuma Ishihara, Daisuke Saito

Data Efficient Voice Cloning from Noisy Samples with Domain Adversarial Training
Jian Cong, Shan Yang, Lei Xie, Guoqiao Yu, Guanglu Wan


Acoustic Event Detection


Gated Multi-Head Attention Pooling for Weakly Labelled Audio Tagging
Sixin Hong, Yuexian Zou, Wenwu Wang

Environmental Sound Classification with Parallel Temporal-Spectral Attention
Helin Wang, Yuexian Zou, Dading Chong, Wenwu Wang

Contrastive Predictive Coding of Audio with an Adversary
Luyu Wang, Kazuya Kawakami, Aaron van den Oord

Memory Controlled Sequential Self Attention for Sound Recognition
Arjun Pankajakshan, Helen L. Bear, Vinod Subramanian, Emmanouil Benetos

Dual Stage Learning Based Dynamic Time-Frequency Mask Generation for Audio Event Classification
Donghyeon Kim, Jaihyun Park, David K. Han, Hanseok Ko

An Effective Perturbation Based Semi-Supervised Learning Method for Sound Event Detection
Xu Zheng, Yan Song, Jie Yan, Li-Rong Dai, Ian McLoughlin, Lin Liu

A Joint Framework for Audio Tagging and Weakly Supervised Acoustic Event Detection Using DenseNet with Global Average Pooling
Chieh-Chi Kao, Bowen Shi, Ming Sun, Chao Wang

Intra-Utterance Similarity Preserving Knowledge Distillation for Audio Tagging
Chun-Chieh Chang, Chieh-Chi Kao, Ming Sun, Chao Wang

Two-Stage Polyphonic Sound Event Detection Based on Faster R-CNN-LSTM with Multi-Token Connectionist Temporal Classification
Inyoung Park, Hong Kook Kim

SpeechMix — Augmenting Deep Sound Recognition Using Hidden Space Interpolations
Amit Jindal, Narayanan Elavathur Ranganatha, Aniket Didolkar, Arijit Ghosh Chowdhury, Di Jin, Ramit Sawhney, Rajiv Ratn Shah


Spoken Language Understanding I


End-to-End Neural Transformer Based Spoken Language Understanding
Martin Radfar, Athanasios Mouchtaris, Siegfried Kunzmann

Jointly Encoding Word Confusion Network and Dialogue Context with BERT for Spoken Language Understanding
Chen Liu, Su Zhu, Zijian Zhao, Ruisheng Cao, Lu Chen, Kai Yu

Speech to Semantics: Improve ASR and NLU Jointly via All-Neural Interfaces
Milind Rao, Anirudh Raju, Pranav Dheram, Bach Bui, Ariya Rastrow

Pretrained Semantic Speech Embeddings for End-to-End Spoken Language Understanding via Cross-Modal Teacher-Student Learning
Pavel Denisov, Ngoc Thang Vu

Context Dependent RNNLM for Automatic Transcription of Conversations
Srikanth Raj Chetupalli, Sriram Ganapathy

Improving End-to-End Speech-to-Intent Classification with Reptile
Yusheng Tian, Philip John Gorinski

Speech to Text Adaptation: Towards an Efficient Cross-Modal Distillation
Won Ik Cho, Donghyun Kwak, Ji Won Yoon, Nam Soo Kim

Towards an ASR Error Robust Spoken Language Understanding System
Weitong Ruan, Yaroslav Nechaev, Luoxin Chen, Chengwei Su, Imre Kiss

End-to-End Spoken Language Understanding Without Full Transcripts
Hong-Kwang J. Kuo, Zoltán Tüske, Samuel Thomas, Yinghui Huang, Kartik Audhkhasi, Brian Kingsbury, Gakuto Kurata, Zvi Kons, Ron Hoory, Luis Lastras

Are Neural Open-Domain Dialog Systems Robust to Speech Recognition Errors in the Dialog History? An Empirical Study
Karthik Gopalakrishnan, Behnam Hedayatnia, Longshaokan Wang, Yang Liu, Dilek Hakkani-Tür


DNN Architectures for Speaker Recognition


AutoSpeech: Neural Architecture Search for Speaker Recognition
Shaojin Ding, Tianlong Chen, Xinyu Gong, Weiwei Zha, Zhangyang Wang

Densely Connected Time Delay Neural Network for Speaker Verification
Ya-Qi Yu, Wu-Jun Li

Phonetically-Aware Coupled Network For Short Duration Text-Independent Speaker Verification
Siqi Zheng, Yun Lei, Hongbin Suo

Multi-Task Network for Noise-Robust Keyword Spotting and Speaker Verification Using CTC-Based Soft VAD and Global Query Attention
Myunghun Jung, Youngmoon Jung, Jahyun Goo, Hoirin Kim

Vector-Based Attentive Pooling for Text-Independent Speaker Verification
Yanfeng Wu, Chenkai Guo, Hongcan Gao, Xiaolei Hou, Jing Xu

Self-Attention Encoding and Pooling for Speaker Recognition
Pooyan Safari, Miquel India, Javier Hernando

ARET: Aggregated Residual Extended Time-Delay Neural Networks for Speaker Verification
Ruiteng Zhang, Jianguo Wei, Wenhuan Lu, Longbiao Wang, Meng Liu, Lin Zhang, Jiayu Jin, Junhai Xu

Adversarial Separation Network for Speaker Recognition
Hanyi Zhang, Longbiao Wang, Yunchun Zhang, Meng Liu, Kong Aik Lee, Jianguo Wei

Text-Independent Speaker Verification with Dual Attention Network
Jingyu Li, Tan Lee

Evolutionary Algorithm Enhanced Neural Architecture Search for Text-Independent Speaker Verification
Xiaoyang Qu, Jianzong Wang, Jing Xiao


ASR Model Training and Strategies


Minimum Bayes Risk Training of RNN-Transducer for End-to-End Speech Recognition
Chao Weng, Chengzhu Yu, Jia Cui, Chunlei Zhang, Dong Yu

Semantic Mask for Transformer Based End-to-End Speech Recognition
Chengyi Wang, Yu Wu, Yujiao Du, Jinyu Li, Shujie Liu, Liang Lu, Shuo Ren, Guoli Ye, Sheng Zhao, Ming Zhou

Faster, Simpler and More Accurate Hybrid ASR Systems Using Wordpieces
Frank Zhang, Yongqiang Wang, Xiaohui Zhang, Chunxi Liu, Yatharth Saraf, Geoffrey Zweig

A Federated Approach in Training Acoustic Models
Dimitrios Dimitriadis, Kenichi Kumatani, Robert Gmyr, Yashesh Gaur, Sefik Emre Eskimez

On Semi-Supervised LF-MMI Training of Acoustic Models with Limited Data
Imran Sheikh, Emmanuel Vincent, Irina Illina

On Front-End Gain Invariant Modeling for Wake Word Spotting
Yixin Gao, Noah D. Stein, Chieh-Chi Kao, Yunliang Cai, Ming Sun, Tao Zhang, Shiv Naga Prasad Vitaladevuni

Unsupervised Regularization-Based Adaptive Training for Speech Recognition
Fenglin Ding, Wu Guo, Bin Gu, Zhen-Hua Ling, Jun Du

On the Robustness and Training Dynamics of Raw Waveform Models
Erfan Loweimi, Peter Bell, Steve Renals

Iterative Pseudo-Labeling for Speech Recognition
Qiantong Xu, Tatiana Likhomanenko, Jacob Kahn, Awni Hannun, Gabriel Synnaeve, Ronan Collobert


Speech Annotation and Speech Assessment


Smart Tube: A Biofeedback System for Vocal Training and Therapy Through Tube Phonation
Naoko Kawamura, Tatsuya Kitamura, Kenta Hamada

VCTUBE : A Library for Automatic Speech Data Annotation
Seong Choi, Seunghoon Jeong, Jeewoo Yoon, Migyeong Yang, Minsam Ko, Eunil Park, Jinyoung Han, Munyoung Lee, Seonghee Lee

A Mandarin L2 Learning APP with Mispronunciation Detection and Feedback
Yanlu Xie, Xiaoli Feng, Boxue Li, Jinsong Zhang, Yujia Jin

Rapid Enhancement of NLP Systems by Acquisition of Data in Correlated Domains
Tejas Udayakumar, Kinnera Saranu, Mayuresh Sanjay Oak, Ajit Ashok Saunshikar, Sandip Shriram Bapat

Computer-Assisted Language Learning System: Automatic Speech Evaluation for Children Learning Malay and Tamil
Ke Shi, Kye Min Tan, Richeng Duan, Siti Umairah Md. Salleh, Nur Farah Ain Suhaimi, Rajan Vellu, Ngoc Thuy Huong Helen Thai, Nancy F. Chen

Real-Time, Full-Band, Online DNN-Based Voice Conversion System Using a Single CPU
Takaaki Saeki, Yuki Saito, Shinnosuke Takamichi, Hiroshi Saruwatari

A Dynamic 3D Pronunciation Teaching Model Based on Pronunciation Attributes and Anatomy
Xiaoli Feng, Yanlu Xie, Yayue Deng, Boxue Li

End-to-End Deep Learning Speech Recognition Model for Silent Speech Challenge
Naoki Kimura, Zixiong Su, Takaaki Saeki



Anti-Spoofing and Liveness Detection


Multi-Task Siamese Neural Network for Improving Replay Attack Detection
Patrick von Platen, Fei Tao, Gokhan Tur

POCO: A Voice Spoofing and Liveness Detection Corpus Based on Pop Noise
Kosuke Akimoto, Seng Pei Liew, Sakiko Mishima, Ryo Mizushima, Kong Aik Lee

Dual-Adversarial Domain Adaptation for Generalized Replay Attack Detection
Hongji Wang, Heinrich Dinkel, Shuai Wang, Yanmin Qian, Kai Yu

Self-Supervised Pre-Training with Acoustic Configurations for Replay Spoofing Detection
Hye-jin Shim, Hee-Soo Heo, Jee-weon Jung, Ha-Jin Yu

Competency Evaluation in Voice Mimicking Using Acoustic Cues
Abhijith G., Adharsh S., Akshay P. L., Rajeev Rajan

Light Convolutional Neural Network with Feature Genuinization for Detection of Synthetic Speech Attacks
Zhenzong Wu, Rohan Kumar Das, Jichen Yang, Haizhou Li

Spoofing Attack Detection Using the Non-Linear Fusion of Sub-Band Classifiers
Hemlata Tak, Jose Patino, Andreas Nautsch, Nicholas Evans, Massimiliano Todisco

Investigating Light-ResNet Architecture for Spoofing Detection Under Mismatched Conditions
Prasanth Parasu, Julien Epps, Kaavya Sriskandaraja, Gajan Suthokumar

Siamese Convolutional Neural Network Using Gaussian Probability Feature for Spoofing Speech Detection
Zhenchun Lei, Yingen Yang, Changhong Liu, Jihua Ye


Noise Reduction and Intelligibility


Lightweight Online Noise Reduction on Embedded Devices Using Hierarchical Recurrent Neural Networks
H. Schröter, T. Rosenkranz, A.N. Escalante-B., P. Zobel, Andreas Maier

SEANet: A Multi-Modal Speech Enhancement Network
Marco Tagliasacchi, Yunpeng Li, Karolis Misiunas, Dominik Roblek

Lite Audio-Visual Speech Enhancement
Shang-Yi Chuang, Yu Tsao, Chen-Chou Lo, Hsin-Min Wang

ORCA-CLEAN: A Deep Denoising Toolkit for Killer Whale Communication
Christian Bergler, Manuel Schmitt, Andreas Maier, Simeon Smeele, Volker Barth, Elmar Nöth

A Deep Learning Approach to Active Noise Control
Hao Zhang, DeLiang Wang

Improving Speech Intelligibility Through Speaker Dependent and Independent Spectral Style Conversion
Tuan Dinh, Alexander Kain, Kris Tjaden

End-to-End Speech Intelligibility Prediction Using Time-Domain Fully Convolutional Neural Networks
Mathias B. Pedersen, Morten Kolbæk, Asger H. Andersen, Søren H. Jensen, Jesper Jensen

Predicting Intelligibility of Enhanced Speech Using Posteriors Derived from DNN-Based ASR System
Kenichi Arai, Shoko Araki, Atsunori Ogawa, Keisuke Kinoshita, Tomohiro Nakatani, Toshio Irino

Automatic Estimation of Intelligibility Measure for Consonants in Speech
Ali Abavisani, Mark Hasegawa-Johnson

Large Scale Evaluation of Importance Maps in Automatic Speech Recognition
Viet Anh Trinh, Michael I. Mandel


Acoustic Scene Classification


Neural Architecture Search on Acoustic Scene Classification
Jixiang Li, Chuming Liang, Bo Zhang, Zhao Wang, Fei Xiang, Xiangxiang Chu

Acoustic Scene Classification Using Audio Tagging
Jee-weon Jung, Hye-jin Shim, Ju-ho Kim, Seung-bin Kim, Ha-Jin Yu

ATReSN-Net: Capturing Attentive Temporal Relations in Semantic Neighborhood for Acoustic Scene Classification
Liwen Zhang, Jiqing Han, Ziqiang Shi

Environment Sound Classification Using Multiple Feature Channels and Attention Based Deep Convolutional Neural Network
Jivitesh Sharma, Ole-Christoffer Granmo, Morten Goodwin

Acoustic Scene Analysis with Multi-Head Attention Networks
Weimin Wang, Weiran Wang, Ming Sun, Chao Wang

Relational Teacher Student Learning with Neural Label Embedding for Device Adaptation in Acoustic Scene Classification
Hu Hu, Sabato Marco Siniscalchi, Yannan Wang, Chin-Hui Lee

An Acoustic Segment Model Based Segment Unit Selection Approach to Acoustic Scene Classification with Partial Utterances
Hu Hu, Sabato Marco Siniscalchi, Yannan Wang, Xue Bai, Jun Du, Chin-Hui Lee

Attention-Driven Projections for Soundscape Classification
Dhanunjaya Varma Devalraju, Muralikrishna H., Padmanabhan Rajan, Dileep Aroor Dinesh

Computer Audition for Continuous Rainforest Occupancy Monitoring: The Case of Bornean Gibbons’ Call Detection
Panagiotis Tzirakis, Alexander Shiarella, Robert Ewers, Björn W. Schuller

Deep Learning Based Open Set Acoustic Scene Classification
Zuzanna Kwiatkowska, Beniamin Kalinowski, Michał Kośmider, Krzysztof Rykaczewski


Singing Voice Computing and Processing in Music


Singing Synthesis: With a Little Help from my Attention
Orazio Angelini, Alexis Moinet, Kayoko Yanagisawa, Thomas Drugman

Peking Opera Synthesis via Duration Informed Attention Network
Yusong Wu, Shengchen Li, Chengzhu Yu, Heng Lu, Chao Weng, Liqiang Zhang, Dong Yu

DurIAN-SC: Duration Informed Attention Network Based Singing Voice Conversion System
Liqiang Zhang, Chengzhu Yu, Heng Lu, Chao Weng, Chunlei Zhang, Yusong Wu, Xiang Xie, Zijin Li, Dong Yu

Transfer Learning for Improving Singing-Voice Detection in Polyphonic Instrumental Music
Yuanbo Hou, Frank K. Soong, Jian Luan, Shengchen Li

Channel-Wise Subband Input for Better Voice and Accompaniment Separation on High Resolution Music
Haohe Liu, Lei Xie, Jian Wu, Geng Yang


Acoustic Model Adaptation for ASR


Continual Learning in Automatic Speech Recognition
Samik Sadhu, Hynek Hermansky

Speaker Adaptive Training for Speech Recognition Based on Attention-Over-Attention Mechanism
Genshun Wan, Jia Pan, Qingran Wang, Jianqing Gao, Zhongfu Ye

Rapid RNN-T Adaptation Using Personalized Speech Synthesis and Neural Language Generator
Yan Huang, Jinyu Li, Lei He, Wenning Wei, William Gale, Yifan Gong

Speech Transformer with Speaker Aware Persistent Memory
Yingzhu Zhao, Chongjia Ni, Cheung-Chi Leung, Shafiq Joty, Eng Siong Chng, Bin Ma

Adaptive Speaker Normalization for CTC-Based Speech Recognition
Fenglin Ding, Wu Guo, Bin Gu, Zhen-Hua Ling, Jun Du

Unsupervised Domain Adaptation Under Label Space Mismatch for Speech Classification
Akhil Mathur, Nadia Berthouze, Nicholas D. Lane

Learning Fast Adaptation on Cross-Accented Speech Recognition
Genta Indra Winata, Samuel Cahyawijaya, Zihan Liu, Zhaojiang Lin, Andrea Madotto, Peng Xu, Pascale Fung

Black-Box Adaptation of ASR for Accented Speech
Kartik Khandelwal, Preethi Jyothi, Abhijeet Awasthi, Sunita Sarawagi

Achieving Multi-Accent ASR via Unsupervised Acoustic Model Adaptation
M.A. Tuğtekin Turan, Emmanuel Vincent, Denis Jouvet

Frame-Wise Online Unsupervised Adaptation of DNN-HMM Acoustic Model from Perspective of Robust Adaptive Filtering
Ryu Takeda, Kazunori Komatani



Intelligibility-Enhancing Speech Modification


Optimization and Evaluation of an Intelligibility-Improving Signal Processing Approach (IISPA) for the Hurricane Challenge 2.0 with FADE
Marc René Schädler

iMetricGAN: Intelligibility Enhancement for Speech-in-Noise Using Generative Adversarial Network-Based Metric Learning
Haoyu Li, Szu-Wei Fu, Yu Tsao, Junichi Yamagishi

Intelligibility-Enhancing Speech Modifications — The Hurricane Challenge 2.0
Jan Rennies, Henning Schepker, Cassia Valentini-Botinhao, Martin Cooke

Exploring Listeners’ Speech Rate Preferences
Olympia Simantiraki, Martin Cooke

Adaptive Compressive Onset-Enhancement for Improved Speech Intelligibility in Noise and Reverberation
Felicitas Bederna, Henning Schepker, Christian Rollwage, Simon Doclo, Arne Pusch, Jörg Bitzer, Jan Rennies

A Sound Engineering Approach to Near End Listening Enhancement
Carol Chermaz, Simon King

Enhancing Speech Intelligibility in Text-To-Speech Synthesis Using Speaking Style Conversion
Dipjyoti Paul, Muhammed P.V. Shifas, Yannis Pantazis, Yannis Stylianou



Targeted Source Separation


SpEx+: A Complete Time Domain Speaker Extraction Network
Meng Ge, Chenglin Xu, Longbiao Wang, Eng Siong Chng, Jianwu Dang, Haizhou Li

Atss-Net: Target Speaker Separation via Attention-Based Neural Network
Tingle Li, Qingjian Lin, Yuanyuan Bao, Ming Li

Multimodal Target Speech Separation with Voice and Face References
Leyuan Qu, Cornelius Weber, Stefan Wermter

X-TaSNet: Robust and Accurate Time-Domain Speaker Extraction Network
Zining Zhang, Bingsheng He, Zhenjie Zhang

Listen, Watch and Understand at the Cocktail Party: Audio-Visual-Contextual Speech Separation
Chenda Li, Yanmin Qian

A Unified Framework for Low-Latency Speaker Extraction in Cocktail Party Environments
Yunzhe Hao, Jiaming Xu, Jing Shi, Peng Zhang, Lei Qin, Bo Xu

Time-Domain Target-Speaker Speech Separation with Waveform-Based Speaker Embedding
Jianshu Zhao, Shengzhou Gao, Takahiro Shinozaki

Listen to What You Want: Neural Network-Based Universal Sound Selector
Tsubasa Ochiai, Marc Delcroix, Yuma Koizumi, Hiroaki Ito, Keisuke Kinoshita, Shoko Araki

Crossmodal Sound Retrieval Based on Specific Target Co-Occurrence Denoted with Weak Labels
Masahiro Yasuda, Yasunori Ohishi, Yuma Koizumi, Noboru Harada

Speaker-Aware Monaural Speech Separation
Jiahao Xu, Kun Hu, Chang Xu, Duc Chung Tran, Zhiyong Wang



Speech Translation and Multilingual/Multimodal Learning


A DNN-HMM-DNN Hybrid Model for Discovering Word-Like Units from Spoken Captions and Image Regions
Liming Wang, Mark Hasegawa-Johnson

Efficient Wait-k Models for Simultaneous Machine Translation
Maha Elbayad, Laurent Besacier, Jakob Verbeek

Investigating Self-Supervised Pre-Training for End-to-End Speech Translation
Ha Nguyen, Fethi Bougares, N. Tomashenko, Yannick Estève, Laurent Besacier

Contextualized Translation of Automatically Segmented Speech
Marco Gaido, Mattia A. Di Gangi, Matteo Negri, Mauro Cettolo, Marco Turchi

Self-Training for End-to-End Speech Translation
Juan Pino, Qiantong Xu, Xutai Ma, Mohammad Javad Dousti, Yun Tang

Evaluating and Optimizing Prosodic Alignment for Automatic Dubbing
Marcello Federico, Yogesh Virkar, Robert Enyedi, Roberto Barra-Chicote

Pair Expansion for Learning Multilingual Semantic Embeddings Using Disjoint Visually-Grounded Speech Audio Datasets
Yasunori Ohishi, Akisato Kimura, Takahito Kawanishi, Kunio Kashino, David Harwath, James Glass

Self-Supervised Representations Improve End-to-End Speech Translation
Anne Wu, Changhan Wang, Juan Pino, Jiatao Gu


Speaker Recognition I


Improved RawNet with Feature Map Scaling for Text-Independent Speaker Verification Using Raw Waveforms
Jee-weon Jung, Seung-bin Kim, Hye-jin Shim, Ju-ho Kim, Ha-Jin Yu

Improving Multi-Scale Aggregation Using Feature Pyramid Module for Robust Speaker Verification of Variable-Duration Utterances
Youngmoon Jung, Seong Min Kye, Yeunju Choi, Myunghun Jung, Hoirin Kim

An Adaptive X-Vector Model for Text-Independent Speaker Verification
Bin Gu, Wu Guo, Fenglin Ding, Zhen-Hua Ling, Jun Du

Shouted Speech Compensation for Speaker Verification Robust to Vocal Effort Conditions
Santi Prieto, Alfonso Ortega, Iván López-Espejo, Eduardo Lleida

Sum-Product Networks for Robust Automatic Speaker Identification
Aaron Nicolson, Kuldip K. Paliwal

Segment Aggregation for Short Utterances Speaker Verification Using Raw Waveforms
Seung-bin Kim, Jee-weon Jung, Hye-jin Shim, Ju-ho Kim, Ha-Jin Yu

Siamese X-Vector Reconstruction for Domain Adapted Speaker Recognition
Shai Rozenberg, Hagai Aronowitz, Ron Hoory

Speaker Re-Identification with Speaker Dependent Speech Enhancement
Yanpei Shi, Qiang Huang, Thomas Hain

Blind Speech Signal Quality Estimation for Speaker Verification Systems
Galina Lavrentyeva, Marina Volkova, Anastasia Avdeeva, Sergey Novoselov, Artem Gorlanov, Tseren Andzhukaev, Artem Ivanov, Alexander Kozlov

Investigating Robustness of Adversarial Samples Detection for Automatic Speaker Verification
Xu Li, Na Li, Jinghua Zhong, Xixin Wu, Xunying Liu, Dan Su, Dong Yu, Helen Meng


Spoken Language Understanding II


Modeling ASR Ambiguity for Neural Dialogue State Tracking
Vaishali Pal, Fabien Guillot, Manish Shrivastava, Jean-Michel Renders, Laurent Besacier

ASR Error Correction with Augmented Transformer for Entity Retrieval
Haoyu Wang, Shuyan Dong, Yue Liu, James Logan, Ashish Kumar Agrawal, Yang Liu

Large-Scale Transfer Learning for Low-Resource Spoken Language Understanding
Xueli Jia, Jianzong Wang, Zhiyong Zhang, Ning Cheng, Jing Xiao

Data Balancing for Boosting Performance of Low-Frequency Classes in Spoken Language Understanding
Judith Gaspers, Quynh Do, Fabian Triefenbach

An Interactive Adversarial Reward Learning-Based Spoken Language Understanding System
Yu Wang, Yilin Shen, Hongxia Jin

Style Attuned Pre-Training and Parameter Efficient Fine-Tuning for Spoken Language Understanding
Jin Cao, Jun Wang, Wael Hamza, Kelly Vanee, Shang-Wen Li

Unsupervised Domain Adaptation for Dialogue Sequence Labeling Based on Hierarchical Adversarial Training
Shota Orihashi, Mana Ihori, Tomohiro Tanaka, Ryo Masumura

Deep F-Measure Maximization for End-to-End Speech Understanding
Leda Sarı, Mark Hasegawa-Johnson

An Effective Domain Adaptive Post-Training Method for BERT in Response Selection
Taesun Whang, Dongyub Lee, Chanhee Lee, Kisu Yang, Dongsuk Oh, Heuiseok Lim

Confidence Measure for Speech-to-Concept End-to-End Spoken Language Understanding
Antoine Caubrière, Yannick Estève, Antoine Laurent, Emmanuel Morin


Human Speech Processing


Attention to Indexical Information Improves Voice Recall
Grant L. McGuire, Molly Babel

Categorization of Whistled Consonants by French Speakers
Anaïs Tran Ngoc, Julien Meyer, Fanny Meunier

Whistled Vowel Identification by French Listeners
Anaïs Tran Ngoc, Julien Meyer, Fanny Meunier

F0 Slope and Mean: Cues to Speech Segmentation in French
Maria del Mar Cordero, Fanny Meunier, Nicolas Grimault, Stéphane Pota, Elsa Spinelli

Does French Listeners’ Ability to Use Accentual Information at the Word Level Depend on the Ear of Presentation?
Amandine Michelas, Sophie Dufour

A Perceptual Study of the Five Level Tones in Hmu (Xinzhai Variety)
Wen Liu

Mandarin and English Adults’ Cue-Weighting of Lexical Stress
Zhen Zeng, Karen Mattock, Liquan Liu, Varghese Peter, Alba Tuninetti, Feng-Ming Tsao

Age-Related Differences of Tone Perception in Mandarin-Speaking Seniors
Yan Feng, Gang Peng, William Shi-Yuan Wang

Social and Functional Pressures in Vocal Alignment: Differences for Human and Voice-AI Interlocutors
Georgia Zellou, Michelle Cohn

Identifying Important Time-Frequency Locations in Continuous Speech Utterances
Hassan Salami Kavaki, Michael I. Mandel


Feature Extraction and Distant ASR


Raw Sign and Magnitude Spectra for Multi-Head Acoustic Modelling
Erfan Loweimi, Peter Bell, Steve Renals

Robust Raw Waveform Speech Recognition Using Relevance Weighted Representations
Purvi Agrawal, Sriram Ganapathy

A Deep 2D Convolutional Network for Waveform-Based Speech Recognition
Dino Oglic, Zoran Cvetkovic, Peter Bell, Steve Renals

Lightweight End-to-End Speech Recognition from Raw Audio Data Using Sinc-Convolutions
Ludwig Kürzinger, Nicolas Lindae, Palle Klewitz, Gerhard Rigoll

An Alternative to MFCCs for ASR
Pegah Ghahramani, Hossein Hadian, Daniel Povey, Hynek Hermansky, Sanjeev Khudanpur

Phase Based Spectro-Temporal Features for Building a Robust ASR System
Anirban Dutta, G. Ashishkumar, Ch.V. Rama Rao

Deep Scattering Power Spectrum Features for Robust Speech Recognition
Neethu M. Joy, Dino Oglic, Zoran Cvetkovic, Peter Bell, Steve Renals

FusionRNN: Shared Neural Parameters for Multi-Channel Distant Speech Recognition
Titouan Parcollet, Xinchi Qiu, Nicholas D. Lane

Bandpass Noise Generation and Augmentation for Unified ASR
Kshitiz Kumar, Bo Ren, Yifan Gong, Jian Wu

Deep Learning Based Dereverberation of Temporal Envelopes for Robust Speech Recognition
Anurenjan Purushothaman, Anirudh Sreeram, Rohit Kumar, Sriram Ganapathy


Voice Privacy Challenge


Introducing the VoicePrivacy Initiative
N. Tomashenko, Brij Mohan Lal Srivastava, Xin Wang, Emmanuel Vincent, Andreas Nautsch, Junichi Yamagishi, Nicholas Evans, Jose Patino, Jean-François Bonastre, Paul-Gauthier Noé, Massimiliano Todisco

The Privacy ZEBRA: Zero Evidence Biometric Recognition Assessment
Andreas Nautsch, Jose Patino, N. Tomashenko, Junichi Yamagishi, Paul-Gauthier Noé, Jean-François Bonastre, Massimiliano Todisco, Nicholas Evans

X-Vector Singular Value Modification and Statistical-Based Decomposition with Ensemble Regression Modeling for Speaker Anonymization System
Candy Olivia Mawalim, Kasorn Galajit, Jessada Karnjana, Masashi Unoki

A Comparative Study of Speech Anonymization Metrics
Mohamed Maouche, Brij Mohan Lal Srivastava, Nathalie Vauquier, Aurélien Bellet, Marc Tommasi, Emmanuel Vincent

Design Choices for X-Vector Based Speaker Anonymization
Brij Mohan Lal Srivastava, N. Tomashenko, Xin Wang, Emmanuel Vincent, Junichi Yamagishi, Mohamed Maouche, Aurélien Bellet, Marc Tommasi

Speech Pseudonymisation Assessment Using Voice Similarity Matrices
Paul-Gauthier Noé, Jean-François Bonastre, Driss Matrouf, N. Tomashenko, Andreas Nautsch, Nicholas Evans





Acoustic Phonetics and Prosody


Correlating Cepstra with Formant Frequencies: Implications for Phonetically-Informed Forensic Voice Comparison
Vincent Hughes, Frantz Clermont, Philip Harrison

Prosody and Breathing: A Comparison Between Rhetorical and Information-Seeking Questions in German and Brazilian Portuguese
Jana Neitsch, Plinio A. Barbosa, Oliver Niebuhr

Scaling Processes of Clause Chains in Pitjantjatjara
Rebecca Defina, Catalina Torres, Hywel Stoakes

Neutralization of Voicing Distinction of Stops in Tohoku Dialects of Japanese: Field Work and Acoustic Measurements
Ai Mizoguchi, Ayako Hashimoto, Sanae Matsui, Setsuko Imatomi, Ryunosuke Kobayashi, Mafuyu Kitahara

Correlation Between Prosody and Pragmatics: Case Study of Discourse Markers in French and English
Lou Lee, Denis Jouvet, Katarina Bartkova, Yvon Keromnes, Mathilde Dargnat

An Analysis of Prosodic Prominence Cues to Information Structure in Egyptian Arabic
Dina El Zarka, Anneliese Kelterer, Barbara Schuppler

Lexical Stress in Urdu
Benazir Mumtaz, Tina Bögel, Miriam Butt

Vocal Markers from Sustained Phonation in Huntington’s Disease
Rachid Riad, Hadrien Titeux, Laurie Lemoine, Justine Montillot, Jennifer Hamet Bagnou, Xuan-Nga Cao, Emmanuel Dupoux, Anne-Catherine Bachoud-Lévi

How Rhythm and Timbre Encode Mooré Language in Bendré Drummed Speech
Laure Dentel, Julien Meyer




Speech Classification


Do Face Masks Introduce Bias in Speech Technologies? The Case of Automated Scoring of Speaking Proficiency
Anastassia Loukina, Keelan Evanini, Matthew Mulholland, Ian Blood, Klaus Zechner

A Low Latency ASR-Free End to End Spoken Language Understanding System
Mohamed Mhiri, Samuel Myer, Vikrant Singh Tomar

An Audio-Based Wakeword-Independent Verification System
Joe Wang, Rajath Kumar, Mike Rodehorst, Brian Kulis, Shiv Naga Prasad Vitaladevuni

Learnable Spectro-Temporal Receptive Fields for Robust Voice Type Discrimination
Tyler Vuong, Yangyang Xia, Richard M. Stern

Low Latency Speech Recognition Using End-to-End Prefetching
Shuo-Yiin Chang, Bo Li, David Rybach, Yanzhang He, Wei Li, Tara N. Sainath, Trevor Strohman

AutoSpeech 2020: The Second Automated Machine Learning Challenge for Speech Classification
Jingsong Wang, Tom Ko, Zhen Xu, Xiawei Guo, Souxiang Liu, Wei-Wei Tu, Lei Xie

Building a Robust Word-Level Wakeword Verification Network
Rajath Kumar, Mike Rodehorst, Joe Wang, Jiacheng Gu, Brian Kulis

A Transformer-Based Audio Captioning Model with Keyword Estimation
Yuma Koizumi, Ryo Masumura, Kyosuke Nishida, Masahiro Yasuda, Shoichiro Saito

Neural Architecture Search for Keyword Spotting
Tong Mo, Yakun Yu, Mohammad Salameh, Di Niu, Shangling Jui

Small-Footprint Keyword Spotting with Multi-Scale Temporal Convolution
Ximin Li, Xiaodong Wei, Xiaowei Qin


Speech Synthesis Paradigms and Methods I


Using Cyclic Noise as the Source Signal for Neural Source-Filter-Based Speech Waveform Model
Xin Wang, Junichi Yamagishi

Unconditional Audio Generation with Generative Adversarial Networks and Cycle Regularization
Jen-Yu Liu, Yu-Hua Chen, Yin-Cheng Yeh, Yi-Hsuan Yang

Complex-Valued Variational Autoencoder: A Novel Deep Generative Model for Direct Representation of Complex Spectra
Toru Nakashika

Attentron: Few-Shot Text-to-Speech Utilizing Attention-Based Variable-Length Embedding
Seungwoo Choi, Seungju Han, Dongyoung Kim, Sungjoo Ha

Reformer-TTS: Neural Speech Synthesis with Reformer Network
Hyeong Rae Ihm, Joun Yeop Lee, Byoung Jin Choi, Sung Jun Cheon, Nam Soo Kim

CycleGAN-VC3: Examining and Improving CycleGAN-VCs for Mel-Spectrogram Conversion
Takuhiro Kaneko, Hirokazu Kameoka, Kou Tanaka, Nobukatsu Hojo

High Quality Streaming Speech Synthesis with Low, Sentence-Length-Independent Latency
Nikolaos Ellinas, Georgios Vamvoukakis, Konstantinos Markopoulos, Aimilios Chalamandaris, Georgia Maniati, Panos Kakoulidis, Spyros Raptis, June Sig Sung, Hyoungmin Park, Pirros Tsiakoulis

DurIAN: Duration Informed Attention Network for Speech Synthesis
Chengzhu Yu, Heng Lu, Na Hu, Meng Yu, Chao Weng, Kun Xu, Peng Liu, Deyi Tuo, Shiyin Kang, Guangzhi Lei, Dan Su, Dong Yu

Multi-Speaker Text-to-Speech Synthesis Using Deep Gaussian Processes
Kentaro Mitsui, Tomoki Koriyama, Hiroshi Saruwatari

A Hybrid HMM-Waveglow Based Text-to-Speech Synthesizer Using Histogram Equalization for Low Resource Indian Languages
Mano Ranjith Kumar M., Sudhanshu Srivastava, Anusha Prakash, Hema A. Murthy


The INTERSPEECH 2020 Computational Paralinguistics ChallengE (ComParE)


The INTERSPEECH 2020 Computational Paralinguistics Challenge: Elderly Emotion, Breathing & Masks
Björn W. Schuller, Anton Batliner, Christian Bergler, Eva-Maria Messner, Antonia Hamilton, Shahin Amiriparian, Alice Baird, Georgios Rizos, Maximilian Schmitt, Lukas Stappen, Harald Baumeister, Alexis Deighton MacIntyre, Simone Hantke

Learning Higher Representations from Pre-Trained Deep Models with Data Augmentation for the COMPARE 2020 Challenge Mask Task
Tomoya Koike, Kun Qian, Björn W. Schuller, Yoshiharu Yamamoto

Surgical Mask Detection with Convolutional Neural Networks and Data Augmentations on Spectrograms
Steffen Illium, Robert Müller, Andreas Sedlmeier, Claudia Linnhoff-Popien

Surgical Mask Detection with Deep Recurrent Phonetic Models
Philipp Klumpp, Tomás Arias-Vergara, Juan Camilo Vásquez-Correa, Paula Andrea Pérez-Toro, Florian Hönig, Elmar Nöth, Juan Rafael Orozco-Arroyave

Phonetic, Frame Clustering and Intelligibility Analyses for the INTERSPEECH 2020 ComParE Challenge
Claude Montacié, Marie-José Caraty

Exploring Text and Audio Embeddings for Multi-Dimension Elderly Emotion Recognition
Mariana Julião, Alberto Abad, Helena Moniz

Ensembling End-to-End Deep Models for Computational Paralinguistics Tasks: ComParE 2020 Mask and Breathing Sub-Challenges
Maxim Markitantov, Denis Dresvyanskiy, Danila Mamontov, Heysem Kaya, Wolfgang Minker, Alexey Karpov

Analyzing Breath Signals for the Interspeech 2020 ComParE Challenge
John Mendonça, Francisco Teixeira, Isabel Trancoso, Alberto Abad

Deep Attentive End-to-End Continuous Breath Sensing from Speech
Alexis Deighton MacIntyre, Georgios Rizos, Anton Batliner, Alice Baird, Shahin Amiriparian, Antonia Hamilton, Björn W. Schuller

Paralinguistic Classification of Mask Wearing by Image Classifiers and Fusion
Jeno Szep, Salim Hariri

Exploration of Acoustic and Lexical Cues for the INTERSPEECH 2020 Computational Paralinguistic Challenge
Ziqing Yang, Zifan An, Zehao Fan, Chengye Jing, Houwei Cao

Is Everything Fine, Grandma? Acoustic and Linguistic Modeling for Robust Elderly Speech Emotion Recognition
Gizem Soğancıoğlu, Oxana Verkholyak, Heysem Kaya, Dmitrii Fedotov, Tobias Cadée, Albert Ali Salah, Alexey Karpov

Are you Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs
Nicolae-Cătălin Ristea, Radu Tudor Ionescu


Streaming ASR


1-D Row-Convolution LSTM: Fast Streaming ASR at Accuracy Parity with LC-BLSTM
Kshitiz Kumar, Chaojun Liu, Yifan Gong, Jian Wu

Low Latency End-to-End Streaming Speech Recognition with a Scout Network
Chengyi Wang, Yu Wu, Liang Lu, Shujie Liu, Jinyu Li, Guoli Ye, Ming Zhou

Knowledge Distillation from Offline to Streaming RNN Transducer for End-to-End Speech Recognition
Gakuto Kurata, George Saon

Parallel Rescoring with Transformer for Streaming On-Device Speech Recognition
Wei Li, James Qin, Chung-Cheng Chiu, Ruoming Pang, Yanzhang He

Improved Hybrid Streaming ASR with Transformer Language Models
Pau Baquero-Arnal, Javier Jorge, Adrià Giménez, Joan Albert Silvestre-Cerdà, Javier Iranzo-Sánchez, Albert Sanchis, Jorge Civera, Alfons Juan

Streaming Transformer-Based Acoustic Models Using Self-Attention with Augmented Memory
Chunyang Wu, Yongqiang Wang, Yangyang Shi, Ching-Feng Yeh, Frank Zhang

Enhancing Monotonic Multihead Attention for Streaming ASR
Hirofumi Inaguma, Masato Mimura, Tatsuya Kawahara

Streaming Chunk-Aware Multihead Attention for Online End-to-End Speech Recognition
Shiliang Zhang, Zhifu Gao, Haoneng Luo, Ming Lei, Jie Gao, Zhijie Yan, Lei Xie

High Performance Sequence-to-Sequence Model for Streaming Speech Recognition
Thai-Son Nguyen, Ngoc-Quan Pham, Sebastian Stüker, Alex Waibel

Transfer Learning Approaches for Streaming End-to-End Speech Recognition System
Vikas Joshi, Rui Zhao, Rupesh R. Mehta, Kshitiz Kumar, Jinyu Li


Alzheimer’s Dementia Recognition Through Spontaneous Speech


Tackling the ADReSS Challenge: A Multimodal Approach to the Automated Recognition of Alzheimer’s Dementia
Matej Martinc, Senja Pollak

Disfluencies and Fine-Tuning Pre-Trained Language Models for Detection of Alzheimer’s Disease
Jiahong Yuan, Yuchen Bian, Xingyu Cai, Jiaji Huang, Zheng Ye, Kenneth Church

To BERT or not to BERT: Comparing Speech and Language-Based Approaches for Alzheimer’s Disease Detection
Aparna Balagopalan, Benjamin Eyre, Frank Rudzicz, Jekaterina Novikova

Alzheimer’s Dementia Recognition Through Spontaneous Speech: The ADReSS Challenge
Saturnino Luz, Fasih Haider, Sofia de la Fuente, Davida Fromm, Brian MacWhinney

Using State of the Art Speaker Recognition and Natural Language Processing Technologies to Detect Alzheimer’s Disease and Assess its Severity
Raghavendra Pappagari, Jaejin Cho, Laureano Moro-Velázquez, Najim Dehak

A Comparison of Acoustic and Linguistics Methodologies for Alzheimer’s Dementia Recognition
Nicholas Cummins, Yilin Pan, Zhao Ren, Julian Fritsch, Venkata Srikanth Nallanthighal, Heidi Christensen, Daniel Blackburn, Björn W. Schuller, Mathew Magimai-Doss, Helmer Strik, Aki Härmä

Multi-Modal Fusion with Gating Using Audio, Lexical and Disfluency Features for Alzheimer’s Dementia Recognition from Spontaneous Speech
Morteza Rohanian, Julian Hough, Matthew Purver

Comparing Natural Language Processing Techniques for Alzheimer’s Dementia Prediction in Spontaneous Speech
Thomas Searle, Zina Ibrahim, Richard Dobson

Multiscale System for Alzheimer’s Dementia Recognition Through Spontaneous Speech
Erik Edwards, Charles Dognin, Bajibabu Bollepalli, Maneesh Singh

The INESC-ID Multi-Modal System for the ADReSS 2020 Challenge
Anna Pompili, Thomas Rolland, Alberto Abad

Exploring MMSE Score Prediction Using Verbal and Non-Verbal Cues
Shahla Farzana, Natalie Parde

Multimodal Inductive Transfer Learning for Detection of Alzheimer’s Dementia and its Severity
Utkarsh Sarawgi, Wazeer Zulfikar, Nouran Soliman, Pattie Maes

Exploiting Multi-Modal Features from Pre-Trained Networks for Alzheimer’s Dementia Recognition
Junghyun Koo, Jie Hwan Lee, Jaewoo Pyo, Yujin Jo, Kyogu Lee

Automated Screening for Alzheimer’s Dementia Through Spontaneous Speech
Muhammad Shehram Shah Syed, Zafi Sherhan Syed, Margaret Lech, Elena Pirogova


Speaker Recognition Challenges and Applications


NEC-TT Speaker Verification System for SRE’19 CTS Challenge
Kong Aik Lee, Koji Okabe, Hitoshi Yamamoto, Qiongqiong Wang, Ling Guo, Takafumi Koshinaka, Jiacen Zhang, Keisuke Ishikawa, Koichi Shinoda

THUEE System for NIST SRE19 CTS Challenge
Ruyun Li, Tianyu Liang, Dandan Song, Yi Liu, Yangcheng Wu, Can Xu, Peng Ouyang, Xianwei Zhang, Xianhong Chen, Wei-Qiang Zhang, Shouyi Yin, Liang He

Automatic Quality Assessment for Audio-Visual Verification Systems. The LOVe Submission to NIST SRE Challenge 2019
Grigory Antipov, Nicolas Gengembre, Olivier Le Blouch, Gaël Le Lan

Audio-Visual Speaker Recognition with a Cross-Modal Discriminative Network
Ruijie Tao, Rohan Kumar Das, Haizhou Li

Multimodal Association for Speaker Verification
Suwon Shon, James Glass

Multi-Modality Matters: A Performance Leap on VoxCeleb
Zhengyang Chen, Shuai Wang, Yanmin Qian

Cross-Domain Adaptation with Discrepancy Minimization for Text-Independent Forensic Speaker Verification
Zhenyu Wang, Wei Xia, John H.L. Hansen

Open-Set Short Utterance Forensic Speaker Verification Using Teacher-Student Network with Explicit Inductive Bias
Mufan Sang, Wei Xia, John H.L. Hansen

JukeBox: A Multilingual Singer Recognition Dataset
Anurag Chowdhury, Austin Cozzo, Arun Ross

Speaker Identification for Household Scenarios with Self-Attention and Adversarial Training
Ruirui Li, Jyun-Yu Jiang, Xian Wu, Chu-Cheng Hsieh, Andreas Stolcke


Applications of ASR


Streaming Keyword Spotting on Mobile Devices
Oleg Rybakov, Natasha Kononenko, Niranjan Subrahmanya, Mirkó Visontai, Stella Laurenzo

Metadata-Aware End-to-End Keyword Spotting
Hongyi Liu, Apurva Abhyankar, Yuriy Mishchenko, Thibaud Sénéchal, Gengshen Fu, Brian Kulis, Noah D. Stein, Anish Shah, Shiv Naga Prasad Vitaladevuni

Adversarial Audio: A New Information Hiding Method
Yehao Kong, Jiliang Zhang

S2IGAN: Speech-to-Image Generation via Adversarial Learning
Xinsheng Wang, Tingting Qiao, Jihua Zhu, Alan Hanjalic, Odette Scharenborg

Automatic Speech Recognition Benchmark for Air-Traffic Communications
Juan Zuluaga-Gomez, Petr Motlicek, Qingran Zhan, Karel Veselý, Rudolf Braun

Whisper Augmented End-to-End/Hybrid Speech Recognition System — CycleGAN Approach
Prithvi R.R. Gudepu, Gowtham P. Vadisetti, Abhishek Niranjan, Kinnera Saranu, Raghava Sarma, M. Ali Basha Shaik, Periyasamy Paramasivam

Risk Forecasting from Earnings Calls Acoustics and Network Correlations
Ramit Sawhney, Arshiya Aggarwal, Piyush Khanna, Puneet Mathur, Taru Jain, Rajiv Ratn Shah

SpecMark: A Spectral Watermarking Framework for IP Protection of Speech Recognition Systems
Huili Chen, Bita Darvish, Farinaz Koushanfar

Evaluating Automatically Generated Phoneme Captions for Images
Justin van der Hout, Zoltán D’Haese, Mark Hasegawa-Johnson, Odette Scharenborg




Single-Channel Speech Enhancement I


Singing Voice Extraction with Attention-Based Spectrograms Fusion
Hao Shi, Longbiao Wang, Sheng Li, Chenchen Ding, Meng Ge, Nan Li, Jianwu Dang, Hiroshi Seki

Incorporating Broad Phonetic Information for Speech Enhancement
Yen-Ju Lu, Chien-Feng Liao, Xugang Lu, Jeih-weih Hung, Yu Tsao

A Recursive Network with Dynamic Attention for Monaural Speech Enhancement
Andong Li, Chengshi Zheng, Cunhang Fan, Renhua Peng, Xiaodong Li

Constrained Ratio Mask for Speech Enhancement Using DNN
Hongjiang Yu, Wei-Ping Zhu, Yuhong Yang

SERIL: Noise Adaptive Speech Enhancement Using Regularization-Based Incremental Learning
Chi-Chang Lee, Yu-Chen Lin, Hsuan-Tien Lin, Hsin-Min Wang, Yu Tsao

Adaptive Neural Speech Enhancement with a Denoising Variational Autoencoder
Yoshiaki Bando, Kouhei Sekiguchi, Kazuyoshi Yoshii

Low-Latency Single Channel Speech Dereverberation Using U-Net Convolutional Neural Networks
Ahmet E. Bulut, Kazuhito Koishida

Single-Channel Speech Enhancement by Subspace Affinity Minimization
Dung N. Tran, Kazuhito Koishida

Noise Tokens: Learning Neural Noise Templates for Environment-Aware Speech Enhancement
Haoyu Li, Junichi Yamagishi

NAAGN: Noise-Aware Attention-Gated Network for Speech Enhancement
Feng Deng, Tao Jiang, Xiao-Rui Wang, Chen Zhang, Yan Li


Deep Noise Suppression Challenge


Online Monaural Speech Enhancement Using Delayed Subband LSTM
Xiaofei Li, Radu Horaud

INTERSPEECH 2020 Deep Noise Suppression Challenge: A Fully Convolutional Recurrent Network (FCRN) for Joint Dereverberation and Denoising
Maximilian Strake, Bruno Defraene, Kristoff Fluyt, Wouter Tirry, Tim Fingscheidt

DCCRN: Deep Complex Convolution Recurrent Network for Phase-Aware Speech Enhancement
Yanxin Hu, Yun Liu, Shubo Lv, Mengtao Xing, Shimin Zhang, Yihui Fu, Jian Wu, Bihong Zhang, Lei Xie

Dual-Signal Transformation LSTM Network for Real-Time Noise Suppression
Nils L. Westhausen, Bernd T. Meyer

A Perceptually-Motivated Approach for Low-Complexity, Real-Time Enhancement of Fullband Speech
Jean-Marc Valin, Umut Isik, Neerad Phansalkar, Ritwik Giri, Karim Helwani, Arvindh Krishnaswamy

PoCoNet: Better Speech Enhancement with Frequency-Positional Embeddings, Semi-Supervised Conversational Data, and Biased Loss
Umut Isik, Ritwik Giri, Neerad Phansalkar, Jean-Marc Valin, Karim Helwani, Arvindh Krishnaswamy

The INTERSPEECH 2020 Deep Noise Suppression Challenge: Datasets, Subjective Testing Framework, and Challenge Results
Chandan K.A. Reddy, Vishak Gopal, Ross Cutler, Ebrahim Beyrami, Roger Cheng, Harishchandra Dubey, Sergiy Matusevych, Robert Aichner, Ashkan Aazami, Sebastian Braun, Puneet Rana, Sriram Srinivasan, Johannes Gehrke


Voice and Hearing Disorders


The Implication of Sound Level on Spatial Selective Auditory Attention for Cochlear Implant Users: Behavioral and Electrophysiological Measurement
Sara Akbarzadeh, Sungmin Lee, Chin-Tuan Tan

Enhancing the Interaural Time Difference of Bilateral Cochlear Implants with the Temporal Limits Encoder
Yangyang Wan, Huali Zhou, Qinglin Meng, Nengheng Zheng

Speech Clarity Improvement by Vocal Self-Training Using a Hearing Impairment Simulator and its Correlation with an Auditory Modulation Index
Toshio Irino, Soichi Higashiyama, Hanako Yoshigi

Investigation of Phase Distortion on Perceived Speech Quality for Hearing-Impaired Listeners
Zhuohuang Zhang, Donald S. Williamson, Yi Shen

EEG-Based Short-Time Auditory Attention Detection Using Multi-Task Deep Learning
Zhuo Zhang, Gaoyan Zhang, Jianwu Dang, Shuang Wu, Di Zhou, Longbiao Wang

Towards Interpreting Deep Learning Models to Understand Loss of Speech Intelligibility in Speech Disorders — Step 1: CNN Model-Based Phone Classification
Sondes Abderrazek, Corinne Fredouille, Alain Ghio, Muriel Lalain, Christine Meunier, Virginie Woisard

Improving Cognitive Impairment Classification by Generative Neural Network-Based Feature Augmentation
Bahman Mirheidari, Daniel Blackburn, Ronan O’Malley, Annalena Venneri, Traci Walker, Markus Reuber, Heidi Christensen

UncommonVoice: A Crowdsourced Dataset of Dysphonic Speech
Meredith Moore, Piyush Papreja, Michael Saxon, Visar Berisha, Sethuraman Panchanathan

Towards Automatic Assessment of Voice Disorders: A Clinical Approach
Purva Barche, Krishna Gurugubelli, Anil Kumar Vuppala

BlaBla: Linguistic Feature Extraction for Clinical Analysis in Multiple Languages
Abhishek Shivkumar, Jack Weston, Raphael Lenain, Emil Fristed




Monaural Source Separation


Separating Varying Numbers of Sources with Auxiliary Autoencoding Loss
Yi Luo, Nima Mesgarani

On Synthesis for Supervised Monaural Speech Separation in Time Domain
Jingjing Chen, Qirong Mao, Dong Liu

Learning Better Speech Representations by Worsening Interference
Jun Wang

Asteroid: The PyTorch-Based Audio Source Separation Toolkit for Researchers
Manuel Pariente, Samuele Cornell, Joris Cosentino, Sunit Sivasankaran, Efthymios Tzinis, Jens Heitkaemper, Michel Olvera, Fabian-Robert Stöter, Mathieu Hu, Juan M. Martín-Doñas, David Ditter, Ariel Frank, Antoine Deleforge, Emmanuel Vincent

Dual-Path Transformer Network: Direct Context-Aware Modeling for End-to-End Monaural Speech Separation
Jingjing Chen, Qirong Mao, Dong Liu

Conv-TasSAN: Separative Adversarial Network Based on Conv-TasNet
Chengyun Deng, Yi Zhang, Shiqian Ma, Yongtao Sha, Hui Song, Xiangang Li

Multi-Path RNN for Hierarchical Modeling of Long Sequential Data and its Application to Speaker Stream Separation
Keisuke Kinoshita, Thilo von Neumann, Marc Delcroix, Tomohiro Nakatani, Reinhold Haeb-Umbach

Unsupervised Audio Source Separation Using Generative Priors
Vivek Narayanaswamy, Jayaraman J. Thiagarajan, Rushil Anirudh, Andreas Spanias


Single-Channel Speech Enhancement II


Adversarial Latent Representation Learning for Speech Enhancement
Yuanhang Qiu, Ruili Wang

An NMF-HMM Speech Enhancement Method Based on Kullback-Leibler Divergence
Yang Xiang, Liming Shi, Jesper Lisby Højvang, Morten Højfeldt Rasmussen, Mads Græsbøll Christensen

Multi-Scale TCN: Exploring Better Temporal DNN Model for Causal Speech Enhancement
Lu Zhang, Mingjiang Wang

VoiceFilter-Lite: Streaming Targeted Voice Separation for On-Device Speech Recognition
Quan Wang, Ignacio Lopez Moreno, Mert Saglam, Kevin Wilson, Alan Chiao, Renjie Liu, Yanzhang He, Wei Li, Jason Pelecanos, Marily Nika, Alexander Gruenstein

Speech Separation Based on Multi-Stage Elaborated Dual-Path Deep BiLSTM with Auxiliary Identity Loss
Ziqiang Shi, Rujie Liu, Jiqing Han

Sub-Band Knowledge Distillation Framework for Speech Enhancement
Xiang Hao, Shixue Wen, Xiangdong Su, Yun Liu, Guanglai Gao, Xiaofei Li

A Deep Learning-Based Kalman Filter for Speech Enhancement
Sujan Kumar Roy, Aaron Nicolson, Kuldip K. Paliwal

Subband Kalman Filtering with DNN Estimated Parameters for Speech Enhancement
Hongjiang Yu, Wei-Ping Zhu, Benoit Champagne

Bidirectional LSTM Network with Ordered Neurons for Speech Enhancement
Xiaoqi Li, Yaxing Li, Yuanjie Dong, Shan Xu, Zhihui Zhang, Dan Wang, Shengwu Xiong

Speaker-Conditional Chain Model for Speech Separation and Extraction
Jing Shi, Jiaming Xu, Yusuke Fujita, Shinji Watanabe, Bo Xu



Neural Signals for Spoken Communication


Combining Audio and Brain Activity for Predicting Speech Quality
Ivan Halim Parmonangan, Hiroki Tanaka, Sakriani Sakti, Satoshi Nakamura

The “Sound of Silence” in EEG — Cognitive Voice Activity Detection
Rini A. Sharon, Hema A. Murthy

Low Latency Auditory Attention Detection with Common Spatial Pattern Analysis of EEG Signals
Siqi Cai, Enze Su, Yonghao Song, Longhan Xie, Haizhou Li

Speech Spectrogram Estimation from Intracranial Brain Activity Using a Quantization Approach
Miguel Angrick, Christian Herff, Garett Johnson, Jerry Shih, Dean Krusienski, Tanja Schultz

Neural Speech Decoding for Amyotrophic Lateral Sclerosis
Debadatta Dash, Paul Ferrari, Angel Hernandez, Daragh Heitzman, Sara G. Austin, Jun Wang


Training Strategies for ASR


Semi-Supervised ASR by End-to-End Self-Training
Yang Chen, Weiran Wang, Chao Wang

Improved Training Strategies for End-to-End Speech Recognition in Digital Voice Assistants
Hitesh Tulsiani, Ashtosh Sapru, Harish Arsikere, Surabhi Punjabi, Sri Garimella

Serialized Output Training for End-to-End Overlapped Speech Recognition
Naoyuki Kanda, Yashesh Gaur, Xiaofei Wang, Zhong Meng, Takuya Yoshioka

Semi-Supervised Learning with Data Augmentation for End-to-End ASR
Felix Weninger, Franco Mana, Roberto Gemello, Jesús Andrés-Ferrer, Puming Zhan

Efficient Minimum Word Error Rate Training of RNN-Transducer for End-to-End Speech Recognition
Jinxi Guo, Gautam Tiwari, Jasha Droppo, Maarten Van Segbroeck, Che-Wei Huang, Andreas Stolcke, Roland Maas

A New Training Pipeline for an Improved Neural Transducer
Albert Zeyer, André Merboldt, Ralf Schlüter, Hermann Ney

Improved Noisy Student Training for Automatic Speech Recognition
Daniel S. Park, Yu Zhang, Ye Jia, Wei Han, Chung-Cheng Chiu, Bo Li, Yonghui Wu, Quoc V. Le

Phoneme-to-Grapheme Conversion Based Large-Scale Pre-Training for End-to-End Automatic Speech Recognition
Ryo Masumura, Naoki Makishima, Mana Ihori, Akihiko Takashima, Tomohiro Tanaka, Shota Orihashi

Utterance Invariant Training for Hybrid Two-Pass End-to-End Speech Recognition
Dhananjaya Gowda, Ankur Kumar, Kwangyoun Kim, Hejung Yang, Abhinav Garg, Sachin Singh, Jiyeon Kim, Mehul Kumar, Sichen Jin, Shatrughan Singh, Chanwoo Kim

SCADA: Stochastic, Consistent and Adversarial Data Augmentation to Improve ASR
Gary Wang, Andrew Rosenberg, Zhehuai Chen, Yu Zhang, Bhuvana Ramabhadran, Pedro J. Moreno


Speech Transmission & Coding


Fundamental Frequency Model for Postfiltering at Low Bitrates in a Transform-Domain Speech and Audio Codec
Sneha Das, Tom Bäckström, Guillaume Fuchs

Hearing-Impaired Bio-Inspired Cochlear Models for Real-Time Auditory Applications
Arthur Van Den Broucke, Deepak Baby, Sarah Verhulst

Improving Opus Low Bit Rate Quality with Neural Speech Synthesis
Jan Skoglund, Jean-Marc Valin

A Differentiable Perceptual Audio Metric Learned from Just Noticeable Differences
Pranay Manocha, Adam Finkelstein, Richard Zhang, Nicholas J. Bryan, Gautham J. Mysore, Zeyu Jin

StoRIR: Stochastic Room Impulse Response Generation for Audio Data Augmentation
Piotr Masztalski, Mateusz Matuszewski, Karol Piaskowski, Michal Romaniuk

An Open Source Implementation of ITU-T Recommendation P.808 with Validation
Babak Naderi, Ross Cutler

DNN No-Reference PSTN Speech Quality Prediction
Gabriel Mittag, Ross Cutler, Yasaman Hosseinkashi, Michael Revow, Sriram Srinivasan, Naglakshmi Chande, Robert Aichner

Non-Intrusive Diagnostic Monitoring of Fullband Speech Quality
Sebastian Möller, Tobias Hübschen, Thilo Michael, Gabriel Mittag, Gerhard Schmidt


Bioacoustics and Articulation


Transfer Learning of Articulatory Information Through Phone Information
Abdolreza Sabzi Shahrebabaki, Negar Olfati, Sabato Marco Siniscalchi, Giampiero Salvi, Torbjørn Svendsen

Sequence-to-Sequence Articulatory Inversion Through Time Convolution of Sub-Band Frequency Signals
Abdolreza Sabzi Shahrebabaki, Sabato Marco Siniscalchi, Giampiero Salvi, Torbjørn Svendsen

Discriminative Singular Spectrum Analysis for Bioacoustic Classification
Bernardo B. Gatto, Eulanda M. dos Santos, Juan G. Colonna, Naoya Sogi, Lincon S. Souza, Kazuhiro Fukui

Speech Rate Task-Specific Representation Learning from Acoustic-Articulatory Data
Renuka Mannem, Hima Jyothi R., Aravind Illa, Prasanta Kumar Ghosh

Dysarthria Detection and Severity Assessment Using Rhythm-Based Metrics
Abner Hernandez, Eun Jung Yeo, Sunhee Kim, Minhwa Chung

LungRN+NL: An Improved Adventitious Lung Sound Classification Using Non-Local Block ResNet Neural Network with Mixup Data Augmentation
Yi Ma, Xinzi Xu, Yongfu Li

Attention and Encoder-Decoder Based Models for Transforming Articulatory Movements at Different Speaking Rates
Abhayjeet Singh, Aravind Illa, Prasanta Kumar Ghosh

Adventitious Respiratory Classification Using Attentive Residual Neural Networks
Zijiang Yang, Shuo Liu, Meishu Song, Emilia Parada-Cabaleiro, Björn W. Schuller

Surfboard: Audio Feature Extraction for Modern Machine Learning
Raphael Lenain, Jack Weston, Abhishek Shivkumar, Emil Fristed

Whisper Activity Detection Using CNN-LSTM Based Attention Pooling Network Trained for a Speaker Identification Task
Abinay Reddy Naini, Malla Satyapriya, Prasanta Kumar Ghosh


Speech Synthesis: Multilingual and Cross-Lingual Approaches


Towards Natural Bilingual and Code-Switched Speech Synthesis Based on Mix of Monolingual Recordings and Cross-Lingual Voice Conversion
Shengkui Zhao, Trung Hieu Nguyen, Hao Wang, Bin Ma

Multi-Lingual Multi-Speaker Text-to-Speech Synthesis for Voice Cloning with Online Speaker Enrollment
Zhaoyu Liu, Brian Mak

Dynamic Soft Windowing and Language Dependent Style Token for Code-Switching End-to-End Speech Synthesis
Ruibo Fu, Jianhua Tao, Zhengqi Wen, Jiangyan Yi, Chunyu Qiang, Tao Wang

Phonological Features for 0-Shot Multilingual Speech Synthesis
Marlene Staib, Tian Huey Teh, Alexandra Torresquintero, Devang S. Ram Mohan, Lorenzo Foglianti, Raphael Lenain, Jiameng Gao

Cross-Lingual Text-To-Speech Synthesis via Domain Adaptation and Perceptual Similarity Regression in Speaker Space
Detai Xin, Yuki Saito, Shinnosuke Takamichi, Tomoki Koriyama, Hiroshi Saruwatari

Tone Learning in Low-Resource Bilingual TTS
Ruolan Liu, Xue Wen, Chunhui Lu, Xiao Chen

On Improving Code Mixed Speech Synthesis with Mixlingual Grapheme-to-Phoneme Model
Shubham Bansal, Arijit Mukherjee, Sandeepkumar Satpal, Rupeshkumar Mehta

Generic Indic Text-to-Speech Synthesisers with Rapid Adaptation in an End-to-End Framework
Anusha Prakash, Hema A. Murthy

Efficient Neural Speech Synthesis for Low-Resource Languages Through Multilingual Modeling
Marcel de Korte, Jaebok Kim, Esther Klabbers

One Model, Many Languages: Meta-Learning for Multilingual Text-to-Speech
Tomáš Nekvinda, Ondřej Dušek


Learning Techniques for Speaker Recognition I


In Defence of Metric Learning for Speaker Recognition
Joon Son Chung, Jaesung Huh, Seongkyu Mun, Minjae Lee, Hee-Soo Heo, Soyeon Choe, Chiheon Ham, Sunghwan Jung, Bong-Jin Lee, Icksang Han

Meta-Learning for Short Utterance Speaker Recognition with Imbalance Length Pairs
Seong Min Kye, Youngmoon Jung, Hae Beom Lee, Sung Ju Hwang, Hoirin Kim

Segment-Level Effects of Gender, Nationality and Emotion Information on Text-Independent Speaker Verification
Kai Li, Masato Akagi, Yibo Wu, Jianwu Dang

Weakly Supervised Training of Hierarchical Attention Networks for Speaker Identification
Yanpei Shi, Qiang Huang, Thomas Hain

Multi-Task Learning for Voice Related Recognition Tasks
Ana Montalvo, Jose R. Calvo, Jean-François Bonastre

Unsupervised Training of Siamese Networks for Speaker Verification
Umair Khan, Javier Hernando

An Effective Speaker Recognition Method Based on Joint Identification and Verification Supervisions
Ying Liu, Yan Song, Yiheng Jiang, Ian McLoughlin, Lin Liu, Li-Rong Dai

Speaker-Aware Linear Discriminant Analysis in Speaker Verification
Naijun Zheng, Xixin Wu, Jinghua Zhong, Xunying Liu, Helen Meng

Adversarial Domain Adaptation for Speaker Verification Using Partially Shared Network
Zhengyang Chen, Shuai Wang, Yanmin Qian



Diarization


Partial AUC Optimisation Using Recurrent Neural Networks for Music Detection with Limited Training Data
Pablo Gimeno, Victoria Mingote, Alfonso Ortega, Antonio Miguel, Eduardo Lleida

An Open-Source Voice Type Classifier for Child-Centered Daylong Recordings
Marvin Lavechin, Ruben Bousbib, Hervé Bredin, Emmanuel Dupoux, Alejandrina Cristia

Competing Speaker Count Estimation on the Fusion of the Spectral and Spatial Embedding Space
Chao Peng, Xihong Wu, Tianshu Qu

Audio-Visual Multi-Speaker Tracking Based on the GLMB Framework
Shoufeng Lin, Xinyuan Qian

Towards Speech Robustness for Acoustic Scene Classification
Shuo Liu, Andreas Triantafyllopoulos, Zhao Ren, Björn W. Schuller

Identify Speakers in Cocktail Parties with End-to-End Attention
Junzhe Zhu, Mark Hasegawa-Johnson, Leda Sarı

Multi-Talker ASR for an Unknown Number of Sources: Joint Training of Source Counting, Separation and ASR
Thilo von Neumann, Christoph Boeddeker, Lukas Drude, Keisuke Kinoshita, Marc Delcroix, Tomohiro Nakatani, Reinhold Haeb-Umbach

Attentive Convolutional Recurrent Neural Network Using Phoneme-Level Acoustic Representation for Rare Sound Event Detection
Shreya G. Upadhyay, Bo-Hao Su, Chi-Chun Lee

Detecting and Counting Overlapping Speakers in Distant Speech Scenarios
Samuele Cornell, Maurizio Omologo, Stefano Squartini, Emmanuel Vincent

All-in-One Transformer: Unifying Speech Recognition, Audio Tagging, and Event Detection
Niko Moritz, Gordon Wichern, Takaaki Hori, Jonathan Le Roux


Computational Paralinguistics II


Towards Silent Paralinguistics: Deriving Speaking Mode and Speaker ID from Electromyographic Signals
Lorenz Diener, Shahin Amiriparian, Catarina Botelho, Kevin Scheck, Dennis Küster, Isabel Trancoso, Björn W. Schuller, Tanja Schultz

Predicting Collaborative Task Performance Using Graph Interlocutor Acoustic Network in Small Group Interaction
Shun-Chang Zhong, Bo-Hao Su, Wei Huang, Yi-Ching Liu, Chi-Chun Lee

Very Short-Term Conflict Intensity Estimation Using Fisher Vectors
Gábor Gosztolya

Gaming Corpus for Studying Social Screams
Hiroki Mori, Yuki Kikuchi

Speaker Discrimination in Humans and Machines: Effects of Speaking Style Variability
Amber Afshan, Jody Kreiman, Abeer Alwan

Automatic Prediction of Confidence Level from Children’s Oral Reading Recordings
Kamini Sabu, Preeti Rao

Towards a Comprehensive Assessment of Speech Intelligibility for Pathological Speech
W. Xue, V. Mendoza Ramos, W. Harmsen, Catia Cucchiarini, R.W.N.M. van Hout, Helmer Strik

Effects of Communication Channels and Actor’s Gender on Emotion Identification by Native Mandarin Speakers
Yi Lin, Hongwei Ding

Detection of Voicing and Place of Articulation of Fricatives with Deep Learning in a Virtual Speech and Language Therapy Tutor
Ivo Anjos, Maxine Eskenazi, Nuno Marques, Margarida Grilo, Isabel Guimarães, João Magalhães, Sofia Cavaco


Speech Synthesis Paradigms and Methods II


Unsupervised Learning for Sequence-to-Sequence Text-to-Speech for Low-Resource Languages
Haitong Zhang, Yue Lin

Conditional Spoken Digit Generation with StyleGAN
Kasperi Palkama, Lauri Juvela, Alexander Ilin

Towards Universal Text-to-Speech
Jingzhou Yang, Lei He

Speaker-Independent Mel-Cepstrum Estimation from Articulator Movements Using D-Vector Input
Kouichi Katsurada, Korin Richmond

Enhancing Monotonicity for Robust Autoregressive Transformer TTS
Xiangyu Liang, Zhiyong Wu, Runnan Li, Yanqing Liu, Sheng Zhao, Helen Meng

Incremental Text to Speech for Neural Sequence-to-Sequence Models Using Reinforcement Learning
Devang S. Ram Mohan, Raphael Lenain, Lorenzo Foglianti, Tian Huey Teh, Marlene Staib, Alexandra Torresquintero, Jiameng Gao

Semi-Supervised Learning for Multi-Speaker Text-to-Speech Synthesis Using Discrete Speech Representation
Tao Tu, Yuan-Jui Chen, Alexander H. Liu, Hung-yi Lee

Learning Joint Articulatory-Acoustic Representations with Normalizing Flows
Pramit Saha, Sidney Fels

Investigating Effective Additional Contextual Factors in DNN-Based Spontaneous Speech Synthesis
Yuki Yamashita, Tomoki Koriyama, Yuki Saito, Shinnosuke Takamichi, Yusuke Ijima, Ryo Masumura, Hiroshi Saruwatari

Hider-Finder-Combiner: An Adversarial Architecture for General Speech Signal Modification
Jacob J. Webber, Olivier Perrotin, Simon King



Single-Channel Speech Enhancement III


Noisy-Reverberant Speech Enhancement Using DenseUNet with Time-Frequency Attention
Yan Zhao, DeLiang Wang

On Loss Functions and Recurrency Training for GAN-Based Speech Enhancement Systems
Zhuohuang Zhang, Chengyun Deng, Yi Shen, Donald S. Williamson, Yongtao Sha, Yi Zhang, Hui Song, Xiangang Li

Self-Supervised Adversarial Multi-Task Learning for Vocoder-Based Monaural Speech Enhancement
Zhihao Du, Ming Lei, Jiqing Han, Shiliang Zhang

Deep Speech Inpainting of Time-Frequency Masks
Mikolaj Kegler, Pierre Beckmann, Milos Cernak

Real-Time Single-Channel Deep Neural Network-Based Speech Enhancement on Edge Devices
Nikhil Shankar, Gautam Shreedhar Bhat, Issa M.S. Panahi

Improved Speech Enhancement Using a Time-Domain GAN with Mask Learning
Ju Lin, Sufeng Niu, Adriaan J. van Wijngaarden, Jerome L. McClendon, Melissa C. Smith, Kuang-Ching Wang

Real Time Speech Enhancement in the Waveform Domain
Alexandre Défossez, Gabriel Synnaeve, Yossi Adi

Efficient Low-Latency Speech Enhancement with Mobile Audio Streaming Networks
Michal Romaniuk, Piotr Masztalski, Karol Piaskowski, Mateusz Matuszewski



Computational Resource Constrained Speech Recognition


Accurate Detection of Wake Word Start and End Using a CNN
Christin Jose, Yuriy Mishchenko, Thibaud Sénéchal, Anish Shah, Alex Escott, Shiv Naga Prasad Vitaladevuni

Hybrid Transformer/CTC Networks for Hardware Efficient Voice Triggering
Saurabh Adya, Vineet Garg, Siddharth Sigtia, Pramod Simha, Chandra Dhir

MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition
Somshubra Majumdar, Boris Ginsburg

Iterative Compression of End-to-End ASR Model Using AutoML
Abhinav Mehrotra, Łukasz Dudziak, Jinsu Yeo, Young-yoon Lee, Ravichander Vipperla, Mohamed S. Abdelfattah, Sourav Bhattacharya, Samin Ishtiaq, Alberto Gil C.P. Ramos, SangJeong Lee, Daehyun Kim, Nicholas D. Lane

Quantization Aware Training with Absolute-Cosine Regularization for Automatic Speech Recognition
Hieu Duy Nguyen, Anastasios Alexandridis, Athanasios Mouchtaris

Streaming On-Device End-to-End ASR System for Privacy-Sensitive Voice-Typing
Abhinav Garg, Gowtham P. Vadisetti, Dhananjaya Gowda, Sichen Jin, Aditya Jayasimha, Youngho Han, Jiyeon Kim, Junmo Park, Kwangyoun Kim, Sooyeon Kim, Young-yoon Lee, Kyungbo Min, Chanwoo Kim

Scaling Up Online Speech Recognition Using ConvNets
Vineel Pratap, Qiantong Xu, Jacob Kahn, Gilad Avidov, Tatiana Likhomanenko, Awni Hannun, Vitaliy Liptchinsky, Gabriel Synnaeve, Ronan Collobert

Listen Attentively, and Spell Once: Whole Sentence Generation via a Non-Autoregressive Architecture for Low-Latency Speech Recognition
Ye Bai, Jiangyan Yi, Jianhua Tao, Zhengkun Tian, Zhengqi Wen, Shuai Zhang

Rescore in a Flash: Compact, Cache Efficient Hashing Data Structures for n-Gram Language Models
Grant P. Strimel, Ariya Rastrow, Gautam Tiwari, Adrien Piérard, Jon Webb


Speech Synthesis: Prosody and Emotion


Multi-Speaker Emotion Conversion via Latent Variable Regularization and a Chained Encoder-Decoder-Predictor Network
Ravi Shankar, Hsi-Wei Hsieh, Nicolas Charon, Archana Venkataraman

Non-Parallel Emotion Conversion Using a Deep-Generative Hybrid Network and an Adversarial Pair Discriminator
Ravi Shankar, Jacob Sager, Archana Venkataraman

Laughter Synthesis: Combining Seq2seq Modeling with Transfer Learning
Noé Tits, Kevin El Haddad, Thierry Dutoit

Nonparallel Emotional Speech Conversion Using VAE-GAN
Yuexin Cao, Zhengchen Liu, Minchuan Chen, Jun Ma, Shaojun Wang, Jing Xiao

Principal Style Components: Expressive Style Control and Cross-Speaker Transfer in Neural TTS
Alexander Sorin, Slava Shechtman, Ron Hoory

Converting Anyone’s Emotion: Towards Speaker-Independent Emotional Voice Conversion
Kun Zhou, Berrak Sisman, Mingyang Zhang, Haizhou Li

Controlling the Strength of Emotions in Speech-Like Emotional Sound Generated by WaveNet
Kento Matsumoto, Sunao Hara, Masanobu Abe

Learning Syllable-Level Discrete Prosodic Representation for Expressive Speech Generation
Guangyan Zhang, Ying Qin, Tan Lee

Simultaneous Conversion of Speaker Identity and Emotion Based on Multiple-Domain Adaptive RBM
Takuya Kishida, Shin Tsukamoto, Toru Nakashika

Exploiting Deep Sentential Context for Expressive End-to-End Speech Synthesis
Fengyu Yang, Shan Yang, Qinghua Wu, Yujun Wang, Lei Xie

Hierarchical Multi-Grained Generative Model for Expressive Speech Synthesis
Yukiya Hono, Kazuna Tsuboi, Kei Sawada, Kei Hashimoto, Keiichiro Oura, Yoshihiko Nankaku, Keiichi Tokuda

GAN-Based Data Generation for Speech Emotion Recognition
Sefik Emre Eskimez, Dimitrios Dimitriadis, Robert Gmyr, Kenichi Kumanati

The Phonetic Bases of Vocal Expressed Emotion: Natural versus Acted
Hira Dhamyal, Shahan Ali Memon, Bhiksha Raj, Rita Singh


The Interspeech 2020 Far Field Speaker Verification Challenge


The INTERSPEECH 2020 Far-Field Speaker Verification Challenge
Xiaoyi Qin, Ming Li, Hui Bu, Wei Rao, Rohan Kumar Das, Shrikanth Narayanan, Haizhou Li

Deep Embedding Learning for Text-Dependent Speaker Verification
Peng Zhang, Peng Hu, Xueliang Zhang

STC-Innovation Speaker Recognition Systems for Far-Field Speaker Verification Challenge 2020
Aleksei Gusev, Vladimir Volokhov, Alisa Vinogradova, Tseren Andzhukaev, Andrey Shulipa, Sergey Novoselov, Timur Pekhovsky, Alexander Kozlov

NPU Speaker Verification System for INTERSPEECH 2020 Far-Field Speaker Verification Challenge
Li Zhang, Jian Wu, Lei Xie

The JD AI Speaker Verification System for the FFSVC 2020 Challenge
Ying Tong, Wei Xue, Shanluo Huang, Lu Fan, Chao Zhang, Guohong Ding, Xiaodong He


Multimodal Speech Processing


FaceFilter: Audio-Visual Speech Separation Using Still Images
Soo-Whan Chung, Soyeon Choe, Joon Son Chung, Hong-Goo Kang

Seeing Voices and Hearing Voices: Learning Discriminative Embeddings Using Cross-Modal Self-Supervision
Soo-Whan Chung, Hong-Goo Kang, Joon Son Chung

Fusion Architectures for Word-Based Audiovisual Speech Recognition
Michael Wand, Jürgen Schmidhuber

Audio-Visual Multi-Channel Recognition of Overlapped Speech
Jianwei Yu, Bo Wu, Rongzhi Gu, Shi-Xiong Zhang, Lianwu Chen, Yong Xu, Meng Yu, Dan Su, Dong Yu, Xunying Liu, Helen Meng

TMT: A Transformer-Based Modal Translator for Improving Multimodal Sequence Representations in Audio Visual Scene-Aware Dialog
Wubo Li, Dongwei Jiang, Wei Zou, Xiangang Li

Should we Hard-Code the Recurrence Concept or Learn it Instead ? Exploring the Transformer Architecture for Audio-Visual Speech Recognition
George Sterpu, Christian Saam, Naomi Harte

Resource-Adaptive Deep Learning for Visual Speech Recognition
Alexandros Koumparoulis, Gerasimos Potamianos, Samuel Thomas, Edmilson da Silva Morais

Speech-Image Semantic Alignment Does Not Depend on Any Prior Classification Tasks
Masood S. Mortazavi

Lip Graph Assisted Audio-Visual Speech Recognition Using Bidirectional Synchronous Fusion
Hong Liu, Zhan Chen, Bing Yang

Caption Alignment for Low Resource Audio-Visual Data
Vighnesh Reddy Konda, Mayur Warialani, Rakesh Prasanth Achari, Varad Bhatnagar, Jayaprakash Akula, Preethi Jyothi, Ganesh Ramakrishnan, Gholamreza Haffari, Pankaj Singh



Speech Synthesis: Neural Waveform Generation II


Vocoder-Based Speech Synthesis from Silent Videos
Daniel Michelsanti, Olga Slizovskaia, Gloria Haro, Emilia Gómez, Zheng-Hua Tan, Jesper Jensen

Quasi-Periodic Parallel WaveGAN Vocoder: A Non-Autoregressive Pitch-Dependent Dilated Convolution Model for Parametric Speech Generation
Yi-Chiao Wu, Tomoki Hayashi, Takuma Okamoto, Hisashi Kawai, Tomoki Toda

A Cyclical Post-Filtering Approach to Mismatch Refinement of Neural Vocoder for Text-to-Speech Systems
Yi-Chiao Wu, Patrick Lumban Tobing, Kazuki Yasuhara, Noriyuki Matsunaga, Yamato Ohtani, Tomoki Toda

Audio Dequantization for High Fidelity Audio Generation in Flow-Based Neural Vocoder
Hyun-Wook Yoon, Sang-Hoon Lee, Hyeong-Rae Noh, Seong-Whan Lee

StrawNet: Self-Training WaveNet for TTS in Low-Data Regimes
Manish Sharma, Tom Kenter, Rob Clark

An Efficient Subband Linear Prediction for LPCNet-Based Neural Synthesis
Yang Cui, Xi Wang, Lei He, Frank K. Soong

Reverberation Modeling for Source-Filter-Based Neural Vocoder
Yang Ai, Xin Wang, Junichi Yamagishi, Zhen-Hua Ling

Bunched LPCNet: Vocoder for Low-Cost Neural Text-To-Speech Systems
Ravichander Vipperla, Sangjun Park, Kihyun Choo, Samin Ishtiaq, Kyoungbo Min, Sourav Bhattacharya, Abhinav Mehrotra, Alberto Gil C.P. Ramos, Nicholas D. Lane

Neural Text-to-Speech with a Modeling-by-Generation Excitation Vocoder
Eunwoo Song, Min-Jae Hwang, Ryuichi Yamamoto, Jin-Seob Kim, Ohsung Kwon, Jae-Min Kim

SpeedySpeech: Efficient Neural Speech Synthesis
Jan Vainer, Ondřej Dušek


ASR Neural Network Architectures and Training II


Semi-Supervised End-to-End ASR via Teacher-Student Learning with Conditional Posterior Distribution
Zi-qiang Zhang, Yan Song, Jian-shu Zhang, Ian McLoughlin, Li-Rong Dai

Leveraging Unlabeled Speech for Sequence Discriminative Training of Acoustic Models
Ashtosh Sapru, Sri Garimella

Developing RNN-T Models Surpassing High-Performance Hybrid Models with Customization Capability
Jinyu Li, Rui Zhao, Zhong Meng, Yanqing Liu, Wenning Wei, Sarangarajan Parthasarathy, Vadim Mazalov, Zhenghao Wang, Lei He, Sheng Zhao, Yifan Gong

End-to-End ASR with Adaptive Span Self-Attention
Xuankai Chang, Aswin Shanmugam Subramanian, Pengcheng Guo, Shinji Watanabe, Yuya Fujita, Motoi Omachi

Subword Regularization: An Analysis of Scalability and Generalization for End-to-End Automatic Speech Recognition
Egor Lakomkin, Jahn Heymann, Ilya Sklyar, Simon Wiesler

Early Stage LM Integration Using Local and Global Log-Linear Combination
Wilfried Michel, Ralf Schlüter, Hermann Ney

ContextNet: Improving Convolutional Neural Networks for Automatic Speech Recognition with Global Context
Wei Han, Zhengdong Zhang, Yu Zhang, Jiahui Yu, Chung-Cheng Chiu, James Qin, Anmol Gulati, Ruoming Pang, Yonghui Wu

Emitting Word Timings with End-to-End Models
Tara N. Sainath, Ruoming Pang, David Rybach, Basi García, Trevor Strohman

Low-Latency Sequence-to-Sequence Speech Recognition and Translation by Partial Hypothesis Selection
Danni Liu, Gerasimos Spanakis, Jan Niehues


Neural Networks for Language Modeling


Neural Language Modeling with Implicit Cache Pointers
Ke Li, Daniel Povey, Sanjeev Khudanpur

Finnish ASR with Deep Transformer Models
Abhilash Jain, Aku Rouhe, Stig-Arne Grönroos, Mikko Kurimo

Distilling the Knowledge of BERT for Sequence-to-Sequence ASR
Hayato Futami, Hirofumi Inaguma, Sei Ueno, Masato Mimura, Shinsuke Sakai, Tatsuya Kawahara

Stochastic Convolutional Recurrent Networks for Language Modeling
Jen-Tzung Chien, Yu-Min Huang

Investigation of Large-Margin Softmax in Neural Language Modeling
Jingjing Huo, Yingbo Gao, Weiyue Wang, Ralf Schlüter, Hermann Ney

Contextualizing ASR Lattice Rescoring with Hybrid Pointer Network Language Model
Da-Rong Liu, Chunxi Liu, Frank Zhang, Gabriel Synnaeve, Yatharth Saraf, Geoffrey Zweig

Mask CTC: Non-Autoregressive End-to-End ASR with CTC and Mask Predict
Yosuke Higuchi, Shinji Watanabe, Nanxin Chen, Tetsuji Ogawa, Tetsunori Kobayashi

Insertion-Based Modeling for End-to-End Automatic Speech Recognition
Yuya Fujita, Shinji Watanabe, Motoi Omachi, Xuankai Chang


Phonetic Event Detection and Segmentation


Voice Activity Detection in the Wild via Weakly Supervised Sound Event Detection
Yefei Chen, Heinrich Dinkel, Mengyue Wu, Kai Yu

Dual Attention in Time and Frequency Domain for Voice Activity Detection
Joohyung Lee, Youngmoon Jung, Hoirin Kim

Polishing the Classical Likelihood Ratio Test by Supervised Learning for Voice Activity Detection
Tianjiao Xu, Hui Zhang, Xueliang Zhang

A Noise Robust Technique for Detecting Vowels in Speech Signals
Avinash Kumar, S. Shahnawazuddin, Waquar Ahmad

End-to-End Domain-Adversarial Voice Activity Detection
Marvin Lavechin, Marie-Philippe Gill, Ruben Bousbib, Hervé Bredin, Leibny Paola Garcia-Perera

VOP Detection in Variable Speech Rate Condition
Ayush Agarwal, Jagabandhu Mishra, S.R. Mahadeva Prasanna

MLNET: An Adaptive Multiple Receptive-Field Attention Neural Network for Voice Activity Detection
Zhenpeng Zheng, Jianzong Wang, Ning Cheng, Jian Luo, Jing Xiao

Self-Supervised Contrastive Learning for Unsupervised Phoneme Segmentation
Felix Kreuk, Joseph Keshet, Yossi Adi

That Sounds Familiar: An Analysis of Phonetic Representations Transfer Across Languages
Piotr Żelasko, Laureano Moro-Velázquez, Mark Hasegawa-Johnson, Odette Scharenborg, Najim Dehak

Analyzing Read Aloud Speech by Primary School Pupils: Insights for Research and Development
S. Limonard, Catia Cucchiarini, R.W.N.M. van Hout, Helmer Strik




Learning Techniques for Speaker Recognition II


Dynamic Margin Softmax Loss for Speaker Verification
Dao Zhou, Longbiao Wang, Kong Aik Lee, Yibo Wu, Meng Liu, Jianwu Dang, Jianguo Wei

On Parameter Adaptation in Softmax-Based Cross-Entropy Loss for Improved Convergence Speed and Accuracy in DNN-Based Speaker Recognition
Magdalena Rybicka, Konrad Kowalczyk

Training Speaker Enrollment Models by Network Optimization
Victoria Mingote, Antonio Miguel, Alfonso Ortega, Eduardo Lleida

Supervised Domain Adaptation for Text-Independent Speaker Verification Using Limited Data
Seyyed Saeed Sarfjoo, Srikanth Madikeri, Petr Motlicek, Sébastien Marcel

Angular Margin Centroid Loss for Text-Independent Speaker Recognition
Yuheng Wei, Junzhao Du, Hui Liu

Domain-Invariant Speaker Vector Projection by Model-Agnostic Meta-Learning
Jiawen Kang, Ruiqi Liu, Lantian Li, Yunqi Cai, Dong Wang, Thomas Fang Zheng

ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification
Brecht Desplanques, Jenthe Thienpondt, Kris Demuynck

Length- and Noise-Aware Training Techniques for Short-Utterance Speaker Recognition
Wenda Chen, Jonathan Huang, Tobias Bocklet


Spoken Language Evaluatiosn


Spoken Language ‘Grammatical Error Correction’
Yiting Lu, Mark J.F. Gales, Yu Wang

Mixtures of Deep Neural Experts for Automated Speech Scoring
Sara Papi, Edmondo Trentin, Roberto Gretter, Marco Matassoni, Daniele Falavigna

Targeted Content Feedback in Spoken Language Learning and Assessment
Xinhao Wang, Klaus Zechner, Christopher Hamill

Universal Adversarial Attacks on Spoken Language Assessment Systems
Vyas Raina, Mark J.F. Gales, Kate M. Knill

Ensemble Approaches for Uncertainty in Spoken Language Assessment
Xixin Wu, Kate M. Knill, Mark J.F. Gales, Andrey Malinin

Shadowability Annotation with Fine Granularity on L2 Utterances and its Improvement with Native Listeners’ Script-Shadowing
Zhenchao Lin, Ryo Takashima, Daisuke Saito, Nobuaki Minematsu, Noriko Nakanishi

ASR-Based Evaluation and Feedback for Individualized Reading Practice
Yu Bai, Ferdy Hubers, Catia Cucchiarini, Helmer Strik

Domain Adversarial Neural Networks for Dysarthric Speech Recognition
Dominika Woszczyk, Stavros Petridis, David Millard

Automatic Estimation of Pathological Voice Quality Based on Recurrent Neural Network Using Amplitude and Phase Spectrogram
Shunsuke Hidaka, Yogaku Lee, Kohei Wakamiya, Takashi Nakagawa, Tokihiko Kaburagi


Spoken Dialogue System


Stochastic Curiosity Exploration for Dialogue Systems
Jen-Tzung Chien, Po-Chien Hsu

Conditional Response Augmentation for Dialogue Using Knowledge Distillation
Myeongho Jeong, Seungtaek Choi, Hojae Han, Kyungho Kim, Seung-won Hwang

Prototypical Q Networks for Automatic Conversational Diagnosis and Few-Shot New Disease Adaption
Hongyin Luo, Shang-Wen Li, James Glass

End-to-End Task-Oriented Dialog System Through Template Slot Value Generation
Teakgyu Hong, Oh-Woog Kwon, Young-Kil Kim

Task-Oriented Dialog Generation with Enhanced Entity Representation
Zhenhao He, Jiachun Wang, Jian Chen

End-to-End Speech-to-Dialog-Act Recognition
Viet-Trung Dang, Tianyu Zhao, Sei Ueno, Hirofumi Inaguma, Tatsuya Kawahara

Discriminative Transfer Learning for Optimizing ASR and Semantic Labeling in Task-Oriented Spoken Dialog
Yao Qian, Yu Shi, Michael Zeng

Datasets and Benchmarks for Task-Oriented Log Dialogue Ranking Task
Xinnuo Xu, Yizhe Zhang, Lars Liden, Sungjin Lee



Speech Synthesis: Toward End-to-End Synthesis


From Speaker Verification to Multispeaker Speech Synthesis, Deep Transfer with Feedback Constraint
Zexin Cai, Chuxiong Zhang, Ming Li

Can Speaker Augmentation Improve Multi-Speaker End-to-End TTS?
Erica Cooper, Cheng-I Lai, Yusuke Yasuda, Junichi Yamagishi

Non-Autoregressive End-to-End TTS with Coarse-to-Fine Decoding
Tao Wang, Xuefei Liu, Jianhua Tao, Jiangyan Yi, Ruibo Fu, Zhengqi Wen

Bi-Level Speaker Supervision for One-Shot Speech Synthesis
Tao Wang, Jianhua Tao, Ruibo Fu, Jiangyan Yi, Zhengqi Wen, Chunyu Qiang

Naturalness Enhancement with Linguistic Information in End-to-End TTS Using Unsupervised Parallel Encoding
Alex Peiró-Lilja, Mireia Farrús

MoBoAligner: A Neural Alignment Model for Non-Autoregressive TTS with Monotonic Boundary Search
Naihan Li, Shujie Liu, Yanqing Liu, Sheng Zhao, Ming Liu, Ming Zhou

JDI-T: Jointly Trained Duration Informed Transformer for Text-To-Speech without Explicit Alignment
Dan Lim, Won Jang, Gyeonghwan O, Heayoung Park, Bongwan Kim, Jaesam Yoon

End-to-End Text-to-Speech Synthesis with Unaligned Multiple Language Units Based on Attention
Masashi Aso, Shinnosuke Takamichi, Hiroshi Saruwatari

Attention Forcing for Speech Synthesis
Qingyun Dou, Joshua Efiong, Mark J.F. Gales

Testing the Limits of Representation Mixing for Pronunciation Correction in End-to-End Speech Synthesis
Jason Fong, Jason Taylor, Simon King

MultiSpeech: Multi-Speaker Text to Speech with Transformer
Mingjian Chen, Xu Tan, Yi Ren, Jin Xu, Hao Sun, Sheng Zhao, Tao Qin


Speech Enhancement, Bandwidth Extension and Hearing Aids


Exploiting Conic Affinity Measures to Design Speech Enhancement Systems Operating in Unseen Noise Conditions
Pavlos Papadopoulos, Shrikanth Narayanan

Adversarial Dictionary Learning for Monaural Speech Enhancement
Yunyun Ji, Longting Xu, Wei-Ping Zhu

Semi-Supervised Self-Produced Speech Enhancement and Suppression Based on Joint Source Modeling of Air- and Body-Conducted Signals Using Variational Autoencoder
Shogo Seki, Moe Takada, Tomoki Toda

Spatial Covariance Matrix Estimation for Reverberant Speech with Application to Speech Enhancement
Ran Weisman, Vladimir Tourbabin, Paul Calamia, Boaz Rafaely

A Cross-Channel Attention-Based Wave-U-Net for Multi-Channel Speech Enhancement
Minh Tri Ho, Jinyoung Lee, Bong-Ki Lee, Dong Hoon Yi, Hong-Goo Kang

TinyLSTMs: Efficient Neural Speech Enhancement for Hearing Aids
Igor Fedorov, Marko Stamenovic, Carl Jensen, Li-Chia Yang, Ari Mandell, Yiming Gan, Matthew Mattina, Paul N. Whatmough

Intelligibility Enhancement Based on Speech Waveform Modification Using Hearing Impairment
Shu Hikosaka, Shogo Seki, Tomoki Hayashi, Kazuhiro Kobayashi, Kazuya Takeda, Hideki Banno, Tomoki Toda

Speaker and Phoneme-Aware Speech Bandwidth Extension with Residual Dual-Path Network
Nana Hou, Chenglin Xu, Van Tung Pham, Joey Tianyi Zhou, Eng Siong Chng, Haizhou Li

Multi-Task Learning for End-to-End Noise-Robust Bandwidth Extension
Nana Hou, Chenglin Xu, Joey Tianyi Zhou, Eng Siong Chng, Haizhou Li

Phase-Aware Music Super-Resolution Using Generative Adversarial Networks
Shichao Hu, Bin Zhang, Beici Liang, Ethan Zhao, Simon Lui





The Attacker’s Perpective on Automatic Speaker Verification


The Attacker’s Perspective on Automatic Speaker Verification: An Overview
Rohan Kumar Das, Xiaohai Tian, Tomi Kinnunen, Haizhou Li

Extrapolating False Alarm Rates in Automatic Speaker Verification
Alexey Sholokhov, Tomi Kinnunen, Ville Vestman, Kong Aik Lee

Self-Supervised Spoofing Audio Detection Scheme
Ziyue Jiang, Hongcheng Zhu, Li Peng, Wenbing Ding, Yanzhen Ren

Inaudible Adversarial Perturbations for Targeted Attack in Speaker Recognition
Qing Wang, Pengcheng Guo, Lei Xie

x-Vectors Meet Adversarial Attacks: Benchmarking Adversarial Robustness in Speaker Verification
Jesús Villalba, Yuekai Zhang, Najim Dehak

Black-Box Attacks on Spoofing Countermeasures Using Transferability of Adversarial Examples
Yuekai Zhang, Ziyan Jiang, Jesús Villalba, Najim Dehak


Summarization, Semantic Analysis and Classification


Multimodal Emotion Recognition Using Cross-Modal Attention and 1D Convolutional Neural Networks
Krishna D. N., Ankita Patil

Abstractive Spoken Document Summarization Using Hierarchical Model with Multi-Stage Attention Diversity Optimization
Potsawee Manakul, Mark J.F. Gales, Linlin Wang

Improved Learning of Word Embeddings with Word Definitions and Semantic Injection
Yichi Zhang, Yinpei Dai, Zhijian Ou, Huixin Wang, Junlan Feng

Wake Word Detection with Alignment-Free Lattice-Free MMI
Yiming Wang, Hang Lv, Daniel Povey, Lei Xie, Sanjeev Khudanpur

Improving Vietnamese Named Entity Recognition from Speech Using Word Capitalization and Punctuation Recovery Models
Thai Binh Nguyen, Quang Minh Nguyen, Thi Thu Hien Nguyen, Quoc Truong Do, Chi Mai Luong

End-to-End Named Entity Recognition from English Speech
Hemant Yadav, Sreyan Ghosh, Yi Yu, Rajiv Ratn Shah

Semantic Complexity in End-to-End Spoken Language Understanding
Joseph P. McKenna, Samridhi Choudhary, Michael Saxon, Grant P. Strimel, Athanasios Mouchtaris

Analysis of Disfluency in Children’s Speech
Trang Tran, Morgan Tinkler, Gary Yeung, Abeer Alwan, Mari Ostendorf

Representation Based Meta-Learning for Few-Shot Spoken Intent Recognition
Ashish Mittal, Samarth Bharadwaj, Shreya Khare, Saneem Chemmengath, Karthik Sankaranarayanan, Brian Kingsbury

Complementary Language Model and Parallel Bi-LRNN for False Trigger Mitigation
Rishika Agarwal, Xiaochuan Niu, Pranay Dighe, Srikanth Vishnubhotla, Sameer Badaskar, Devang Naik


Speaker Recognition II


Speaker-Utterance Dual Attention for Speaker and Utterance Verification
Tianchi Liu, Rohan Kumar Das, Maulik Madhavi, Shengmei Shen, Haizhou Li

Adversarial Separation and Adaptation Network for Far-Field Speaker Verification
Lu Yi, Man-Wai Mak

MIRNet: Learning Multiple Identities Representations in Overlapped Speech
Hyewon Han, Soo-Whan Chung, Hong-Goo Kang

Strategies for End-to-End Text-Independent Speaker Verification
Weiwei Lin, Man-Wai Mak, Jen-Tzung Chien

Why Did the x-Vector System Miss a Target Speaker? Impact of Acoustic Mismatch Upon Target Score on VoxCeleb Data
Rosa González Hautamäki, Tomi Kinnunen

Variable Frame Rate-Based Data Augmentation to Handle Speaking-Style Variability for Automatic Speaker Verification
Amber Afshan, Jinxi Guo, Soo Jin Park, Vijay Ravi, Alan McCree, Abeer Alwan

A Machine of Few Words: Interactive Speaker Recognition with Reinforcement Learning
Mathieu Seurin, Florian Strub, Philippe Preux, Olivier Pietquin

Improving On-Device Speaker Verification Using Federated Learning with Privacy
Filip Granqvist, Matt Seigel, Rogier van Dalen, Áine Cahill, Stephen Shum, Matthias Paulik

Neural PLDA Modeling for End-to-End Speaker Verification
Shreyas Ramoji, Prashant Krishnan, Sriram Ganapathy


General Topics in Speech Recognition


State Sequence Pooling Training of Acoustic Models for Keyword Spotting
Kuba Łopatka, Tobias Bocklet

Training Keyword Spotting Models on Non-IID Data with Federated Learning
Andrew Hard, Kurt Partridge, Cameron Nguyen, Niranjan Subrahmanya, Aishanee Shah, Pai Zhu, Ignacio Lopez Moreno, Rajiv Mathews

Class LM and Word Mapping for Contextual Biasing in End-to-End ASR
Rongqing Huang, Ossama Abdel-hamid, Xinwei Li, Gunnar Evermann

Do End-to-End Speech Recognition Models Care About Context?
Lasse Borgholt, Jakob D. Havtorn, Željko Agić, Anders Søgaard, Lars Maaløe, Christian Igel

Utterance Confidence Measure for End-to-End Speech Recognition with Applications to Distributed Speech Recognition Scenarios
Ankur Kumar, Sachin Singh, Dhananjaya Gowda, Abhinav Garg, Shatrughan Singh, Chanwoo Kim

Speaker Code Based Speaker Adaptive Training Using Model Agnostic Meta-Learning
Huaxin Wu, Genshun Wan, Jia Pan

Domain Adaptation Using Class Similarity for Robust Speech Recognition
Han Zhu, Jiangjiang Zhao, Yuling Ren, Li Wang, Pengyuan Zhang

Incremental Machine Speech Chain Towards Enabling Listening While Speaking in Real-Time
Sashi Novitasari, Andros Tjandra, Tomoya Yanagita, Sakriani Sakti, Satoshi Nakamura

Context-Dependent Acoustic Modeling Without Explicit Phone Clustering
Tina Raissi, Eugen Beck, Ralf Schlüter, Hermann Ney

Voice Conversion Based Data Augmentation to Improve Children’s Speech Recognition in Limited Data Scenario
S. Shahnawazuddin, Nagaraj Adiga, Kunal Kumar, Aayushi Poddar, Waquar Ahmad


Speech Synthesis: Prosody Modeling


CopyCat: Many-to-Many Fine-Grained Prosody Transfer for Neural Text-to-Speech
Sri Karlapati, Alexis Moinet, Arnaud Joly, Viacheslav Klimkov, Daniel Sáez-Trigueros, Thomas Drugman

Joint Detection of Sentence Stress and Phrase Boundary for Prosody
Binghuai Lin, Liyuan Wang, Xiaoli Feng, Jinsong Zhang

Transfer Learning of the Expressivity Using FLOW Metric Learning in Multispeaker Text-to-Speech Synthesis
Ajinkya Kulkarni, Vincent Colotte, Denis Jouvet

Speaking Speed Control of End-to-End Speech Synthesis Using Sentence-Level Conditioning
Jae-Sung Bae, Hanbin Bae, Young-Sun Joo, Junmo Lee, Gyeong-Hoon Lee, Hoon-Young Cho

Dynamic Prosody Generation for Speech Synthesis Using Linguistics-Driven Acoustic Embedding Selection
Shubhi Tyagi, Marco Nicolis, Jonas Rohnke, Thomas Drugman, Jaime Lorenzo-Trueba

Improving the Prosody of RNN-Based English Text-To-Speech Synthesis by Incorporating a BERT Model
Tom Kenter, Manish Sharma, Rob Clark

Improved Prosody from Learned F0 Codebook Representations for VQ-VAE Speech Waveform Reconstruction
Yi Zhao, Haoyu Li, Cheng-I Lai, Jennifer Williams, Erica Cooper, Junichi Yamagishi

Prosody Learning Mechanism for Speech Synthesis System Without Text Length Limit
Zhen Zeng, Jianzong Wang, Ning Cheng, Jing Xiao

Discriminative Method to Extract Coarse Prosodic Structure and its Application for Statistical Phrase/Accent Command Estimation
Yuma Shirahata, Daisuke Saito, Nobuaki Minematsu

Controllable Neural Text-to-Speech Synthesis Using Intuitive Prosodic Features
Tuomo Raitio, Ramya Rasipuram, Dan Castellani

Controllable Neural Prosody Synthesis
Max Morrison, Zeyu Jin, Justin Salamon, Nicholas J. Bryan, Gautham J. Mysore

Multi-Reference Neural TTS Stylization with Adversarial Cycle Consistency
Matt Whitehill, Shuang Ma, Daniel McDuff, Yale Song

Interactive Text-to-Speech System via Joint Style Analysis
Yang Gao, Weiyi Zheng, Zhaojun Yang, Thilo Köhler, Christian Fuegen, Qing He




Speech in Health II


Squeeze for Sneeze: Compact Neural Networks for Cold and Flu Recognition
Merlin Albes, Zhao Ren, Björn W. Schuller, Nicholas Cummins

Extended Study on the Use of Vocal Tract Variables to Quantify Neuromotor Coordination in Depression
Nadee Seneviratne, James R. Williamson, Adam C. Lammert, Thomas F. Quatieri, Carol Espy-Wilson

Affective Conditioning on Hierarchical Attention Networks Applied to Depression Detection from Transcribed Clinical Interviews
Danai Xezonaki, Georgios Paraskevopoulos, Alexandros Potamianos, Shrikanth Narayanan

Domain Adaptation for Enhancing Speech-Based Depression Detection in Natural Environmental Conditions Using Dilated CNNs
Zhaocheng Huang, Julien Epps, Dale Joachim, Brian Stasak, James R. Williamson, Thomas F. Quatieri

Making a Distinction Between Schizophrenia and Bipolar Disorder Based on Temporal Parameters in Spontaneous Speech
Gábor Gosztolya, Anita Bagi, Szilvia Szalóki, István Szendi, Ildikó Hoffmann

Prediction of Sleepiness Ratings from Voice by Man and Machine
Mark Huckvale, András Beke, Mirei Ikushima

Tongue and Lip Motion Patterns in Alaryngeal Speech
Kristin J. Teplansky, Alan Wisler, Beiming Cao, Wendy Liang, Chad W. Whited, Ted Mau, Jun Wang

Autoencoder Bottleneck Features with Multi-Task Optimisation for Improved Continuous Dysarthric Speech Recognition
Zhengjun Yue, Heidi Christensen, Jon Barker

Raw Speech Waveform Based Classification of Patients with ALS, Parkinson’s Disease and Healthy Controls Using CNN-BLSTM
Jhansi Mallela, Aravind Illa, Yamini Belur, Nalini Atchayaram, Ravi Yadav, Pradeep Reddy, Dipanjan Gope, Prasanta Kumar Ghosh

Assessment of Parkinson’s Disease Medication State Through Automatic Speech Analysis
Anna Pompili, Rubén Solera-Ureña, Alberto Abad, Rita Cardoso, Isabel Guimarães, Margherita Fabbri, Isabel P. Martins, Joaquim Ferreira


Speech and Audio Quality Assessment


Improving Replay Detection System with Channel Consistency DenseNeXt for the ASVspoof 2019 Challenge
Chao Zhang, Junjie Cheng, Yanmei Gu, Huacan Wang, Jun Ma, Shaojun Wang, Jing Xiao

Subjective Quality Evaluation of Speech Signals Transmitted via BPL-PLC Wired System
Przemyslaw Falkowski-Gilski, Grzegorz Debita, Marcin Habrych, Bogdan Miedzinski, Przemyslaw Jedlikowski, Bartosz Polnik, Jan Wandzio, Xin Wang

Investigating the Visual Lombard Effect with Gabor Based Features
Waito Chiu, Yan Xu, Andrew Abel, Chun Lin, Zhengzheng Tu

Exploration of Audio Quality Assessment and Anomaly Localisation Using Attention Models
Qiang Huang, Thomas Hain

Development of a Speech Quality Database Under Uncontrolled Conditions
Alessandro Ragano, Emmanouil Benetos, Andrew Hines

Evaluating the Reliability of Acoustic Speech Embeddings
Robin Algayres, Mohamed Salah Zaiem, Benoît Sagot, Emmanuel Dupoux

Frame-Level Signal-to-Noise Ratio Estimation Using Deep Learning
Hao Li, DeLiang Wang, Xueliang Zhang, Guanglai Gao

A Pyramid Recurrent Network for Predicting Crowdsourced Speech-Quality Ratings of Real-World Signals
Xuan Dong, Donald S. Williamson

Effect of Spectral Complexity Reduction and Number of Instruments on Musical Enjoyment with Cochlear Implants
Avamarie Brueggeman, John H.L. Hansen

Spectrum Correction: Acoustic Scene Classification with Mismatched Recording Devices
Michał Kośmider


Privacy and Security in Speech Communication


Distributed Summation Privacy for Speech Enhancement
Matt O’Connor, W. Bastiaan Kleijn

Perception of Privacy Measured in the Crowd — Paired Comparison on the Effect of Background Noises
Anna Leschanowsky, Sneha Das, Tom Bäckström, Pablo Pérez Zarazaga

Hide and Speak: Towards Deep Neural Networks for Speech Steganography
Felix Kreuk, Yossi Adi, Bhiksha Raj, Rita Singh, Joseph Keshet

Detecting Adversarial Examples for Speech Recognition via Uncertainty Quantification
Sina Däubener, Lea Schönherr, Asja Fischer, Dorothea Kolossa

Privacy Guarantees for De-Identifying Text Transformations
David Ifeoluwa Adelani, Ali Davody, Thomas Kleinbauer, Dietrich Klakow

Detecting Audio Attacks on ASR Systems with Dropout Uncertainty
Tejas Jayashankar, Jonathan Le Roux, Pierre Moulin


Voice Conversion and Adaptation II


Voice Transformer Network: Sequence-to-Sequence Voice Conversion Using Transformer with Text-to-Speech Pretraining
Wen-Chin Huang, Tomoki Hayashi, Yi-Chiao Wu, Hirokazu Kameoka, Tomoki Toda

Nonparallel Training of Exemplar-Based Voice Conversion System Using INCA-Based Alignment Technique
Hitoshi Suda, Gaku Kotani, Daisuke Saito

Enhancing Intelligibility of Dysarthric Speech Using Gated Convolutional-Based Voice Conversion System
Chen-Yu Chen, Wei-Zhong Zheng, Syu-Siang Wang, Yu Tsao, Pei-Chun Li, Ying-Hui Lai

VQVC+: One-Shot Voice Conversion by Vector Quantization and U-Net Architecture
Da-Yi Wu, Yen-Hao Chen, Hung-yi Lee

Cotatron: Transcription-Guided Speech Encoder for Any-to-Many Voice Conversion Without Parallel Data
Seung-won Park, Doo-young Kim, Myun-chul Joe

Dynamic Speaker Representations Adjustment and Decoder Factorization for Speaker Adaptation in End-to-End Speech Synthesis
Ruibo Fu, Jianhua Tao, Zhengqi Wen, Jiangyan Yi, Tao Wang, Chunyu Qiang

ARVC: An Auto-Regressive Voice Conversion System Without Parallel Training Data
Zheng Lian, Zhengqi Wen, Xinyong Zhou, Songbai Pu, Shengkai Zhang, Jianhua Tao

Improved Zero-Shot Voice Conversion Using Explicit Conditioning Signals
Shahan Nercessian

Non-Parallel Voice Conversion with Fewer Labeled Data by Conditional Generative Adversarial Networks
Minchuan Chen, Weijian Hou, Jun Ma, Shaojun Wang, Jing Xiao

Transferring Source Style in Non-Parallel Voice Conversion
Songxiang Liu, Yuewen Cao, Shiyin Kang, Na Hu, Xunying Liu, Dan Su, Dong Yu, Helen Meng

Voice Conversion Using Speech-to-Speech Neuro-Style Transfer
Ehab A. AlBadawy, Siwei Lyu


Multilingual and Code-Switched ASR


Improving Cross-Lingual Transfer Learning for End-to-End Speech Recognition with Speech Translation
Changhan Wang, Juan Pino, Jiatao Gu

Transliteration Based Data Augmentation for Training Multilingual ASR Acoustic Models in Low Resource Settings
Samuel Thomas, Kartik Audhkhasi, Brian Kingsbury

Multilingual Speech Recognition with Self-Attention Structured Parameterization
Yun Zhu, Parisa Haghani, Anshuman Tripathi, Bhuvana Ramabhadran, Brian Farris, Hainan Xu, Han Lu, Hasim Sak, Isabel Leal, Neeraj Gaur, Pedro J. Moreno, Qian Zhang

Lattice-Free Maximum Mutual Information Training of Multilingual Speech Recognition Systems
Srikanth Madikeri, Banriskhem K. Khonglah, Sibo Tong, Petr Motlicek, Hervé Bourlard, Daniel Povey

Massively Multilingual ASR: 50 Languages, 1 Model, 1 Billion Parameters
Vineel Pratap, Anuroop Sriram, Paden Tomasello, Awni Hannun, Vitaliy Liptchinsky, Gabriel Synnaeve, Ronan Collobert

Multilingual Speech Recognition Using Language-Specific Phoneme Recognition as Auxiliary Task for Indian Languages
Hardik B. Sailor, Thomas Hain

Style Variation as a Vantage Point for Code-Switching
Khyathi Raghavi Chandu, Alan W. Black

Bi-Encoder Transformer Network for Mandarin-English Code-Switching Speech Recognition Using Mixture of Experts
Yizhou Lu, Mingkun Huang, Hao Li, Jiaqi Guo, Yanmin Qian

Improving Low Resource Code-Switched ASR Using Augmented Code-Switched TTS
Yash Sharma, Basil Abraham, Karan Taneja, Preethi Jyothi

Towards Context-Aware End-to-End Code-Switching Speech Recognition
Zimeng Qiu, Yiyuan Li, Xinjian Li, Florian Metze, William M. Campbell


Speech and Voice Disorders


Increasing the Intelligibility and Naturalness of Alaryngeal Speech Using Voice Conversion and Synthetic Fundamental Frequency
Tuan Dinh, Alexander Kain, Robin Samlan, Beiming Cao, Jun Wang

Automatic Assessment of Dysarthric Severity Level Using Audio-Video Cross-Modal Approach in Deep Learning
Han Tong, Hamid Sharifzadeh, Ian McLoughlin

Staged Knowledge Distillation for End-to-End Dysarthric Speech Recognition and Speech Attribute Transcription
Yuqin Lin, Longbiao Wang, Sheng Li, Jianwu Dang, Chenchen Ding

Dysarthric Speech Recognition Based on Deep Metric Learning
Yuki Takashima, Ryoichi Takashima, Tetsuya Takiguchi, Yasuo Ariki

Automatic Glottis Detection and Segmentation in Stroboscopic Videos Using Convolutional Networks
Divya Degala, Achuth Rao M.V., Rahul Krishnamurthy, Pebbili Gopikishore, Veeramani Priyadharshini, Prakash T.K., Prasanta Kumar Ghosh

Acoustic Feature Extraction with Interpretable Deep Neural Network for Neurodegenerative Related Disorder Classification
Yilin Pan, Bahman Mirheidari, Zehai Tu, Ronan O’Malley, Traci Walker, Annalena Venneri, Markus Reuber, Daniel Blackburn, Heidi Christensen

Coswara — A Database of Breathing, Cough, and Voice Sounds for COVID-19 Diagnosis
Neeraj Sharma, Prashant Krishnan, Rohit Kumar, Shreyas Ramoji, Srikanth Raj Chetupalli, Nirmala R., Prasanta Kumar Ghosh, Sriram Ganapathy

Acoustic-Based Articulatory Phenotypes of Amyotrophic Lateral Sclerosis and Parkinson’s Disease: Towards an Interpretable, Hypothesis-Driven Framework of Motor Control
Hannah P. Rowe, Sarah E. Gutz, Marc F. Maffei, Jordan R. Green

Recognising Emotions in Dysarthric Speech Using Typical Speech Data
Lubna Alhinti, Stuart Cunningham, Heidi Christensen

Detecting and Analysing Spontaneous Oral Cancer Speech in the Wild
Bence Mark Halpern, Rob van Son, Michiel van den Brekel, Odette Scharenborg


The Zero Resource Speech Challenge 2020


The Zero Resource Speech Challenge 2020: Discovering Discrete Subword and Word Units
Ewan Dunbar, Julien Karadayi, Mathieu Bernard, Xuan-Nga Cao, Robin Algayres, Lucas Ondel, Laurent Besacier, Sakriani Sakti, Emmanuel Dupoux

Vector-Quantized Neural Networks for Acoustic Unit Discovery in the ZeroSpeech 2020 Challenge
Benjamin van Niekerk, Leanne Nortje, Herman Kamper

Exploration of End-to-End Synthesisers for Zero Resource Speech Challenge 2020
Karthik Pandia D.S., Anusha Prakash, Mano Ranjith Kumar M., Hema A. Murthy

Vector Quantized Temporally-Aware Correspondence Sparse Autoencoders for Zero-Resource Acoustic Unit Discovery
Batuhan Gundogdu, Bolaji Yusuf, Mansur Yesilbursa, Murat Saraclar

Transformer VQ-VAE for Unsupervised Unit Discovery and Speech Synthesis: ZeroSpeech 2020 Challenge
Andros Tjandra, Sakriani Sakti, Satoshi Nakamura

Exploring TTS Without T Using Biologically/Psychologically Motivated Neural Network Modules (ZeroSpeech 2020)
Takashi Morita, Hiroki Koda

Cyclic Spectral Modeling for Unsupervised Unit Discovery into Voice Conversion with Excitation and Waveform Modeling
Patrick Lumban Tobing, Tomoki Hayashi, Yi-Chiao Wu, Kazuhiro Kobayashi, Tomoki Toda

Unsupervised Acoustic Unit Representation Learning for Voice Conversion Using WaveNet Auto-Encoders
Mingjie Chen, Thomas Hain

Unsupervised Discovery of Recurring Speech Patterns Using Probabilistic Adaptive Metrics
Okko Räsänen, María Andrea Cruz Blandón

Self-Expressing Autoencoders for Unsupervised Spoken Term Discovery
Saurabhchand Bhati, Jesús Villalba, Piotr Żelasko, Najim Dehak

Perceptimatic: A Human Speech Perception Benchmark for Unsupervised Subword Modelling
Juliette Millet, Ewan Dunbar

Decoding Imagined, Heard, and Spoken Speech: Classification and Regression of EEG Using a 14-Channel Dry-Contact Mobile Headset
Jonathan Clayton, Scott Wellington, Cassia Valentini-Botinhao, Oliver Watts

Glottal Closure Instants Detection from EGG Signal by Classification Approach
Gurunath Reddy M., K. Sreenivasa Rao, Partha Pratim Das

Classify Imaginary Mandarin Tones with Cortical EEG Signals
Hua Li, Fei Chen


LM Adaptation, Lexical Units and Punctuation


Augmenting Images for ASR and TTS Through Single-Loop and Dual-Loop Multimodal Chain Framework
Johanes Effendi, Andros Tjandra, Sakriani Sakti, Satoshi Nakamura

Punctuation Prediction in Spontaneous Conversations: Can We Mitigate ASR Errors with Retrofitted Word Embeddings?
Łukasz Augustyniak, Piotr Szymański, Mikołaj Morzy, Piotr Żelasko, Adrian Szymczak, Jan Mizgajski, Yishay Carmiel, Najim Dehak

Multimodal Semi-Supervised Learning Framework for Punctuation Prediction in Conversational Speech
Monica Sunkara, Srikanth Ronanki, Dhanush Bekal, Sravan Bodapati, Katrin Kirchhoff

Efficient MDI Adaptation for n-Gram Language Models
Ruizhe Huang, Ke Li, Ashish Arora, Daniel Povey, Sanjeev Khudanpur

Improving Tail Performance of a Deliberation E2E ASR Model Using a Large Text Corpus
Cal Peyser, Sepand Mavandadi, Tara N. Sainath, James Apfel, Ruoming Pang, Shankar Kumar

Language Model Data Augmentation Based on Text Domain Transfer
Atsunori Ogawa, Naohiro Tawara, Marc Delcroix

Contemporary Polish Language Model (Version 2) Using Big Data and Sub-Word Approach
Krzysztof Wołk

Improving Speech Recognition of Compound-Rich Languages
Prabhat Pandey, Volker Leutnant, Simon Wiesler, Jahn Heymann, Daniel Willett

Language Modeling for Speech Analytics in Under-Resourced Languages
Simone Wills, Pieter Uys, Charl van Heerden, Etienne Barnard


Speech in Health I


An Early Study on Intelligent Analysis of Speech Under COVID-19: Severity, Sleep Quality, Fatigue, and Anxiety
Jing Han, Kun Qian, Meishu Song, Zijiang Yang, Zhao Ren, Shuo Liu, Juan Liu, Huaiyuan Zheng, Wei Ji, Tomoya Koike, Xiao Li, Zixing Zhang, Yoshiharu Yamamoto, Björn W. Schuller

An Evaluation of the Effect of Anxiety on Speech — Computational Prediction of Anxiety from Sustained Vowels
Alice Baird, Nicholas Cummins, Sebastian Schnieder, Jarek Krajewski, Björn W. Schuller

Hybrid Network Feature Extraction for Depression Assessment from Speech
Ziping Zhao, Qifei Li, Nicholas Cummins, Bin Liu, Haishuai Wang, Jianhua Tao, Björn W. Schuller

Improving Detection of Alzheimer’s Disease Using Automatic Speech Recognition to Identify High-Quality Segments for More Robust Feature Extraction
Yilin Pan, Bahman Mirheidari, Markus Reuber, Annalena Venneri, Daniel Blackburn, Heidi Christensen

Classification of Manifest Huntington Disease Using Vowel Distortion Measures
Amrit Romana, John Bandon, Noelle Carlozzi, Angela Roberts, Emily Mower Provost

Parkinson’s Disease Detection from Speech Using Single Frequency Filtering Cepstral Coefficients
Sudarsana Reddy Kadiri, Rashmi Kethireddy, Paavo Alku

Automatic Prediction of Speech Intelligibility Based on X-Vectors in the Context of Head and Neck Cancer
Sebastião Quintas, Julie Mauclair, Virginie Woisard, Julien Pinquier

Spectral Moment and Duration of Burst of Plosives in Speech of Children with Hearing Impairment and Typically Developing Children — A Comparative Study
Ajish K. Abraham, M. Pushpavathi, N. Sreedevi, A. Navya, C.M. Vikram, S.R. Mahadeva Prasanna

Aphasic Speech Recognition Using a Mixture of Speech Intelligibility Experts
Matthew Perez, Zakaria Aldeneh, Emily Mower Provost

Automatic Discrimination of Apraxia of Speech and Dysarthria Using a Minimalistic Set of Handcrafted Features
Ina Kodrasi, Michaela Pernon, Marina Laganaro, Hervé Bourlard


ASR Neural Network Architectures II — Transformers


Weak-Attention Suppression for Transformer Based Speech Recognition
Yangyang Shi, Yongqiang Wang, Chunyang Wu, Christian Fuegen, Frank Zhang, Duc Le, Ching-Feng Yeh, Michael L. Seltzer

Conv-Transformer Transducer: Low Latency, Low Frame Rate, Streamable End-to-End Speech Recognition
Wenyong Huang, Wenchao Hu, Yu Ting Yeung, Xiao Chen

Improving Transformer-Based Speech Recognition with Unsupervised Pre-Training and Multi-Task Semantic Knowledge Learning
Song Li, Lin Li, Qingyang Hong, Lingling Liu

Transformer-Based Long-Context End-to-End Speech Recognition
Takaaki Hori, Niko Moritz, Chiori Hori, Jonathan Le Roux

Self-and-Mixed Attention Decoder with Deep Acoustic Structure for Transformer-Based LVCSR
Xinyuan Zhou, Grandee Lee, Emre Yılmaz, Yanhua Long, Jiaen Liang, Haizhou Li

Universal Speech Transformer
Yingzhu Zhao, Chongjia Ni, Cheung-Chi Leung, Shafiq Joty, Eng Siong Chng, Bin Ma

Spike-Triggered Non-Autoregressive Transformer for End-to-End Speech Recognition
Zhengkun Tian, Jiangyan Yi, Jianhua Tao, Ye Bai, Shuai Zhang, Zhengqi Wen

Cross Attention with Monotonic Alignment for Speech Transformer
Yingzhu Zhao, Chongjia Ni, Cheung-Chi Leung, Shafiq Joty, Eng Siong Chng, Bin Ma

Conformer: Convolution-augmented Transformer for Speech Recognition
Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki Parmar, Yu Zhang, Jiahui Yu, Wei Han, Shibo Wang, Zhengdong Zhang, Yonghui Wu, Ruoming Pang

Exploring Transformers for Large-Scale Speech Recognition
Liang Lu, Changliang Liu, Jinyu Li, Yifan Gong


Spatial Audio


Sparseness-Aware DOA Estimation with Majorization Minimization
Masahito Togami, Robin Scheibler

Spatial Resolution of Early Reflection for Speech and White Noise
Xiaoli Zhong, Hao Song, Xuejie Liu

Effect of Microphone Position Measurement Error on RIR and its Impact on Speech Intelligibility and Quality
Aditya Raikar, Karan Nathwani, Ashish Panda, Sunil Kumar Kopparapu

Online Blind Reverberation Time Estimation Using CRNNs
Shuwen Deng, Wolfgang Mack, Emanuël A.P. Habets

Single-Channel Blind Direct-to-Reverberation Ratio Estimation Using Masking
Wolfgang Mack, Shuwen Deng, Emanuël A.P. Habets

The Importance of Time-Frequency Averaging for Binaural Speaker Localization in Reverberant Environments
Hanan Beit-On, Vladimir Tourbabin, Boaz Rafaely

Acoustic Signal Enhancement Using Relative Harmonic Coefficients: Spherical Harmonics Domain Approach
Yonggang Hu, Prasanga N. Samarasinghe, Thushara D. Abhayapala

Instantaneous Time Delay Estimation of Broadband Signals
B.H.V.S. Narayana Murthy, J.V. Satyanarayana, Nivedita Chennupati, B. Yegnanarayana

U-Net Based Direct-Path Dominance Test for Robust Direction-of-Arrival Estimation
Hao Wang, Kai Chen, Jing Lu

Sound Event Localization and Detection Based on Multiple DOA Beamforming and Multi-Task Learning
Wei Xue, Ying Tong, Chao Zhang, Guohong Ding, Xiaodong He, Bowen Zhou


Keynote 1

ASR Neural Network Architectures I

Multi-Channel Speech Enhancement

Speech Processing in the Brain

Speech Signal Representation

Speech Synthesis: Neural Waveform Generation I

Automatic Speech Recognition for Non-Native Children’s Speech

Speaker Diarization

Noise Robust and Distant Speech Recognition

Speech in Multimodality

Speech, Language, and Multimodal Resources

Language Recognition

Speech Processing and Analysis

Speech Emotion Recognition I

ASR Neural Network Architectures and Training I

Evaluation of Speech Technology Systems and Methods for Resource Construction and Annotation

Phonetics and Phonology

Topics in ASR I

Large-Scale Evaluation of Short-Duration Speaker Verification

Voice Conversion and Adaptation I

Acoustic Event Detection

Spoken Language Understanding I

DNN Architectures for Speaker Recognition

ASR Model Training and Strategies

Speech Annotation and Speech Assessment

Cross/Multi-Lingual and Code-Switched Speech Recognition

Anti-Spoofing and Liveness Detection

Noise Reduction and Intelligibility

Acoustic Scene Classification

Singing Voice Computing and Processing in Music

Acoustic Model Adaptation for ASR

Singing and Multimodal Synthesis

Intelligibility-Enhancing Speech Modification

Human Speech Production I

Targeted Source Separation

Keynote 2

Speech Translation and Multilingual/Multimodal Learning

Speaker Recognition I

Spoken Language Understanding II

Human Speech Processing

Feature Extraction and Distant ASR

Voice Privacy Challenge

Speech Synthesis: Text Processing, Data and Evaluation

Search for Speech Recognition

Computational Paralinguistics I

Acoustic Phonetics and Prosody

Keynote 3

Tonal Aspects of Acoustic Phonetics and Prosody

Speech Classification

Speech Synthesis Paradigms and Methods I

The INTERSPEECH 2020 Computational Paralinguistics ChallengE (ComParE)

Streaming ASR

Alzheimer’s Dementia Recognition Through Spontaneous Speech

Speaker Recognition Challenges and Applications

Applications of ASR

Speech Emotion Recognition II

Bi- and Multilinguality

Single-Channel Speech Enhancement I

Deep Noise Suppression Challenge

Voice and Hearing Disorders

Spoken Term Detection

The Fearless Steps Challenge Phase-02

Monaural Source Separation

Single-Channel Speech Enhancement II

Topics in ASR II

Neural Signals for Spoken Communication

Training Strategies for ASR

Speech Transmission & Coding

Bioacoustics and Articulation

Speech Synthesis: Multilingual and Cross-Lingual Approaches

Learning Techniques for Speaker Recognition I

Pronunciation

Diarization

Computational Paralinguistics II

Speech Synthesis Paradigms and Methods II

Speaker Embedding

Single-Channel Speech Enhancement III

Multi-Channel Audio and Emotion Recognition

Computational Resource Constrained Speech Recognition

Speech Synthesis: Prosody and Emotion

The Interspeech 2020 Far Field Speaker Verification Challenge

Multimodal Speech Processing

Keynote 4

Speech Synthesis: Neural Waveform Generation II

ASR Neural Network Architectures and Training II

Neural Networks for Language Modeling

Phonetic Event Detection and Segmentation

Human Speech Production II

New Trends in Self-Supervised Speech Processing

Learning Techniques for Speaker Recognition II

Spoken Language Evaluatiosn

Spoken Dialogue System

Dereverberation and Echo Cancellation

Speech Synthesis: Toward End-to-End Synthesis

Speech Enhancement, Bandwidth Extension and Hearing Aids

Speech Emotion Recognition III

Accoustic Phonetics of L1-L2 and Other Interactions

Conversational Systems

The Attacker’s Perpective on Automatic Speaker Verification

Summarization, Semantic Analysis and Classification

Speaker Recognition II

General Topics in Speech Recognition

Speech Synthesis: Prosody Modeling

Language Learning

Speech Enhancement

Speech in Health II

Speech and Audio Quality Assessment

Privacy and Security in Speech Communication

Voice Conversion and Adaptation II

Multilingual and Code-Switched ASR

Speech and Voice Disorders

The Zero Resource Speech Challenge 2020

LM Adaptation, Lexical Units and Punctuation

Speech in Health I

ASR Neural Network Architectures II — Transformers

Spatial Audio