ISCA Archive Interspeech 2021 Sessions Website
  ISCA Archive Sessions Website
top

Interspeech 2021

Brno, Czechia
30 August - 3 September 2021

General Chairs: Hynek Heřmanský, Honza Černocký; Technical Chairs: Lukáš Burget, Lori Lamel, Odette Scharenborg, Petr Motlicek
doi: 10.21437/Interspeech.2021




Speech Synthesis: Toward End-to-End Synthesis II


TacoLPCNet: Fast and Stable TTS by Conditioning LPCNet on Mel Spectrogram Predictions
Cheng Gong, Longbiao Wang, Ju Zhang, Shaotong Guo, Yuguang Wang, Jianwu Dang

FastPitchFormant: Source-Filter Based Decomposed Modeling for Speech Synthesis
Taejun Bak, Jae-Sung Bae, Hanbin Bae, Young-Ik Kim, Hoon-Young Cho

Sequence-to-Sequence Learning for Deep Gaussian Process Based Speech Synthesis Using Self-Attention GP Layer
Taiki Nakamura, Tomoki Koriyama, Hiroshi Saruwatari

Phonetic and Prosodic Information Estimation from Texts for Genuine Japanese End-to-End Text-to-Speech
Naoto Kakegawa, Sunao Hara, Masanobu Abe, Yusuke Ijima

Information Sieve: Content Leakage Reduction in End-to-End Prosody Transfer for Expressive Speech Synthesis
Xudong Dai, Cheng Gong, Longbiao Wang, Kaili Zhang

Deliberation-Based Multi-Pass Speech Synthesis
Qingyun Dou, Xixin Wu, Moquan Wan, Yiting Lu, Mark J.F. Gales

Parallel Tacotron 2: A Non-Autoregressive Neural TTS Model with Differentiable Duration Modeling
Isaac Elias, Heiga Zen, Jonathan Shen, Yu Zhang, Ye Jia, R.J. Skerry-Ryan, Yonghui Wu

Transformer-Based Acoustic Modeling for Streaming Speech Synthesis
Chunyang Wu, Zhiping Xiu, Yangyang Shi, Ozlem Kalinli, Christian Fuegen, Thilo Koehler, Qing He

PnG BERT: Augmented BERT on Phonemes and Graphemes for Neural TTS
Ye Jia, Heiga Zen, Jonathan Shen, Yu Zhang, Yonghui Wu

Speed up Training with Variable Length Inputs by Efficient Batching Strategies
Zhenhao Ge, Lakshmish Kaushik, Masanori Omote, Saket Kumar


Speech Enhancement and Intelligibility


Funnel Deep Complex U-Net for Phase-Aware Speech Enhancement
Yuhang Sun, Linju Yang, Huifeng Zhu, Jie Hao

Temporal Convolutional Network with Frequency Dimension Adaptive Attention for Speech Enhancement
Qiquan Zhang, Qi Song, Aaron Nicolson, Tian Lan, Haizhou Li

Perceptual Contributions of Vowels and Consonant-Vowel Transitions in Understanding Time-Compressed Mandarin Sentences
Changjie Pan, Feng Yang, Fei Chen

Transfer Learning for Speech Intelligibility Improvement in Noisy Environments
Ritujoy Biswas, Karan Nathwani, Vinayak Abrol

Comparison of Remote Experiments Using Crowdsourcing and Laboratory Experiments on Speech Intelligibility
Ayako Yamamoto, Toshio Irino, Kenichi Arai, Shoko Araki, Atsunori Ogawa, Keisuke Kinoshita, Tomohiro Nakatani

Know Your Enemy, Know Yourself: A Unified Two-Stage Framework for Speech Enhancement
Wenzhe Liu, Andong Li, Yuxuan Ke, Chengshi Zheng, Xiaodong Li

Speech Enhancement with Weakly Labelled Data from AudioSet
Qiuqiang Kong, Haohe Liu, Xingjian Du, Li Chen, Rui Xia, Yuxuan Wang

Improving Perceptual Quality by Phone-Fortified Perceptual Loss Using Wasserstein Distance for Speech Enhancement
Tsun-An Hsieh, Cheng Yu, Szu-Wei Fu, Xugang Lu, Yu Tsao

MetricGAN+: An Improved Version of MetricGAN for Speech Enhancement
Szu-Wei Fu, Cheng Yu, Tsun-An Hsieh, Peter Plantinga, Mirco Ravanelli, Xugang Lu, Yu Tsao

A Spectro-Temporal Glimpsing Index (STGI) for Speech Intelligibility Prediction
Amin Edraki, Wai-Yip Chan, Jesper Jensen, Daniel Fogerty

Self-Supervised Learning Based Phone-Fortified Speech Enhancement
Yuanhang Qiu, Ruili Wang, Satwinder Singh, Zhizhong Ma, Feng Hou

Incorporating Embedding Vectors from a Human Mean-Opinion Score Prediction Model for Monaural Speech Enhancement
Khandokar Md. Nayem, Donald S. Williamson

Restoring Degraded Speech via a Modified Diffusion Model
Jianwei Zhang, Suren Jayasuriya, Visar Berisha



Topics in ASR: Robustness, Feature Extraction, and Far-Field ASR


End-to-End Spelling Correction Conditioned on Acoustic Feature for Code-Switching Speech Recognition
Shuai Zhang, Jiangyan Yi, Zhengkun Tian, Ye Bai, Jianhua Tao, Xuefei Liu, Zhengqi Wen

Phoneme Recognition Through Fine Tuning of Phonetic Representations: A Case Study on Luhya Language Varieties
Kathleen Siminyu, Xinjian Li, Antonios Anastasopoulos, David R. Mortensen, Michael R. Marlo, Graham Neubig

Speech Acoustic Modelling Using Raw Source and Filter Components
Erfan Loweimi, Zoran Cvetkovic, Peter Bell, Steve Renals

Noise Robust Acoustic Modeling for Single-Channel Speech Recognition Based on a Stream-Wise Transformer Architecture
Masakiyo Fujimoto, Hisashi Kawai

IR-GAN: Room Impulse Response Generator for Far-Field Speech Recognition
Anton Ratnarajah, Zhenyu Tang, Dinesh Manocha

Scaling Sparsemax Based Channel Selection for Speech Recognition with ad-hoc Microphone Arrays
Junqi Chen, Xiao-Lei Zhang

Multi-Channel Transformer Transducer for Speech Recognition
Feng-Ju Chang, Martin Radfar, Athanasios Mouchtaris, Maurizio Omologo

Data Augmentation Methods for End-to-End Speech Recognition on Distant-Talk Scenarios
Emiru Tsunoo, Kentaro Shibata, Chaitanya Narisetty, Yosuke Kashiwagi, Shinji Watanabe

Leveraging Phone Mask Training for Phonetic-Reduction-Robust E2E Uyghur Speech Recognition
Guodong Ma, Pengfei Hu, Jian Kang, Shen Huang, Hao Huang

Rethinking Evaluation in ASR: Are Our Models Robust Enough?
Tatiana Likhomanenko, Qiantong Xu, Vineel Pratap, Paden Tomasello, Jacob Kahn, Gilad Avidov, Ronan Collobert, Gabriel Synnaeve

Raw Waveform Encoder with Multi-Scale Globally Attentive Locally Recurrent Networks for End-to-End Speech Recognition
Max W.Y. Lam, Jun Wang, Chao Weng, Dan Su, Dong Yu


Voice Activity Detection and Keyword Spotting


Attention-Based Cross-Modal Fusion for Audio-Visual Voice Activity Detection in Musical Video Streams
Yuanbo Hou, Zhesong Yu, Xia Liang, Xingjian Du, Bilei Zhu, Zejun Ma, Dick Botteldooren

Noise-Tolerant Self-Supervised Learning for Audio-Visual Voice Activity Detection
Ui-Hyun Kim

Noisy Student-Teacher Training for Robust Keyword Spotting
Hyun-Jin Park, Pai Zhu, Ignacio Lopez Moreno, Niranjan Subrahmanya

Multi-Channel VAD for Transcription of Group Discussion
Osamu Ichikawa, Kaito Nakano, Takahiro Nakayama, Hajime Shirouzu

Audio-Visual Information Fusion Using Cross-Modal Teacher-Student Learning for Voice Activity Detection in Realistic Environments
Hengshun Zhou, Jun Du, Hang Chen, Zijun Jing, Shifu Xiong, Chin-Hui Lee

Enrollment-Less Training for Personalized Voice Activity Detection
Naoki Makishima, Mana Ihori, Tomohiro Tanaka, Akihiko Takashima, Shota Orihashi, Ryo Masumura

Voice Activity Detection for Live Speech of Baseball Game Based on Tandem Connection with Speech/Noise Separation Model
Yuto Nonaka, Chee Siang Leow, Akio Kobayashi, Takehito Utsuro, Hiromitsu Nishizaki

FastICARL: Fast Incremental Classifier and Representation Learning with Efficient Budget Allocation in Audio Sensing Applications
Young D. Kwon, Jagmohan Chauhan, Cecilia Mascolo

End-to-End Transformer-Based Open-Vocabulary Keyword Spotting with Location-Guided Local Attention
Bo Wei, Meirong Yang, Tao Zhang, Xiao Tang, Xing Huang, Kyuhong Kim, Jaeyun Lee, Kiho Cho, Sung-Un Park

Segmental Contrastive Predictive Coding for Unsupervised Word Segmentation
Saurabhchand Bhati, Jesús Villalba, Piotr Żelasko, Laureano Moro-Velázquez, Najim Dehak

A Lightweight Framework for Online Voice Activity Detection in the Wild
Xuenan Xu, Heinrich Dinkel, Mengyue Wu, Kai Yu



The INTERSPEECH 2021 Computational Paralinguistics Challenge (ComParE) — COVID-19 Cough, COVID-19 Speech, Escalation & Primates


The INTERSPEECH 2021 Computational Paralinguistics Challenge: COVID-19 Cough, COVID-19 Speech, Escalation & Primates
Björn W. Schuller, Anton Batliner, Christian Bergler, Cecilia Mascolo, Jing Han, Iulia Lefter, Heysem Kaya, Shahin Amiriparian, Alice Baird, Lukas Stappen, Sandra Ottl, Maurice Gerczuk, Panagiotis Tzirakis, Chloë Brown, Jagmohan Chauhan, Andreas Grammenos, Apinan Hasthanasombat, Dimitris Spathis, Tong Xia, Pietro Cicuta, Leon J.M. Rothkrantz, Joeri A. Zwerts, Jelle Treep, Casper S. Kaandorp

Transfer Learning-Based Cough Representations for Automatic Detection of COVID-19
Rubén Solera-Ureña, Catarina Botelho, Francisco Teixeira, Thomas Rolland, Alberto Abad, Isabel Trancoso

The Phonetic Footprint of Covid-19?
P. Klumpp, T. Bocklet, T. Arias-Vergara, J.C. Vásquez-Correa, P.A. Pérez-Toro, S.P. Bayerl, J.R. Orozco-Arroyave, Elmar Nöth

Transfer Learning and Data Augmentation Techniques to the COVID-19 Identification Tasks in ComParE 2021
Edresson Casanova, Arnaldo Candido Jr., Ricardo Corso Fernandes Jr., Marcelo Finger, Lucas Rafael Stefanel Gris, Moacir Antonelli Ponti, Daniel Peixoto Pinto da Silva

Visual Transformers for Primates Classification and Covid Detection
Steffen Illium, Robert Müller, Andreas Sedlmeier, Claudia-Linnhoff Popien

Deep-Learning-Based Central African Primate Species Classification with MixUp and SpecAugment
Thomas Pellegrini

A Deep and Recurrent Architecture for Primate Vocalization Classification
Robert Müller, Steffen Illium, Claudia Linnhoff-Popien

Introducing a Central African Primate Vocalisation Dataset for Automated Species Classification
Joeri A. Zwerts, Jelle Treep, Casper S. Kaandorp, Floor Meewis, Amparo C. Koot, Heysem Kaya

Multi-Attentive Detection of the Spider Monkey Whinny in the (Actual) Wild
Georgios Rizos, Jenna Lawson, Zhuoda Han, Duncan Butler, James Rosindell, Krystian Mikolajczyk, Cristina Banks-Leite, Björn W. Schuller

Identifying Conflict Escalation and Primates by Using Ensemble X-Vectors and Fisher Vector Features
José Vicente Egas-López, Mercedes Vetráb, László Tóth, Gábor Gosztolya

Ensemble-Within-Ensemble Classification for Escalation Prediction from Speech
Oxana Verkholyak, Denis Dresvyanskiy, Anastasia Dvoynikova, Denis Kotov, Elena Ryumina, Alena Velichko, Danila Mamontov, Wolfgang Minker, Alexey Karpov

Analysis by Synthesis: Using an Expressive TTS Model as Feature Extractor for Paralinguistic Speech Classification
Dominik Schiller, Silvan Mertes, Pol van Rijn, Elisabeth André





Acoustic Event Detection and Acoustic Scene Classification


SpecMix : A Mixed Sample Data Augmentation Method for Training with Time-Frequency Domain Features
Gwantae Kim, David K. Han, Hanseok Ko

SpecAugment++: A Hidden Space Data Augmentation Method for Acoustic Scene Classification
Helin Wang, Yuexian Zou, Wenwu Wang

An Effective Mutual Mean Teaching Based Domain Adaptation Method for Sound Event Detection
Xu Zheng, Yan Song, Li-Rong Dai, Ian McLoughlin, Lin Liu

Acoustic Scene Classification Using Kervolution-Based SubSpectralNet
Ritika Nandi, Shashank Shekhar, Manjunath Mulimani

Event Specific Attention for Polyphonic Sound Event Detection
Harshavardhan Sundar, Ming Sun, Chao Wang

AST: Audio Spectrogram Transformer
Yuan Gong, Yu-An Chung, James Glass

Shallow Convolution-Augmented Transformer with Differentiable Neural Computer for Low-Complexity Classification of Variable-Length Acoustic Scene
Soonshin Seo, Donghyun Lee, Ji-Hwan Kim

An Evaluation of Data Augmentation Methods for Sound Scene Geotagging
Helen L. Bear, Veronica Morfi, Emmanouil Benetos

Optimizing Latency for Online Video Captioning Using Audio-Visual Transformers
Chiori Hori, Takaaki Hori, Jonathan Le Roux

Variational Information Bottleneck for Effective Low-Resource Audio Classification
Shijing Si, Jianzong Wang, Huiming Sun, Jianhan Wu, Chuanyao Zhang, Xiaoyang Qu, Ning Cheng, Lei Chen, Jing Xiao

Improving Weakly Supervised Sound Event Detection with Self-Supervised Auxiliary Tasks
Soham Deshmukh, Bhiksha Raj, Rita Singh

Acoustic Event Detection with Classifier Chains
Tatsuya Komatsu, Shinji Watanabe, Koichi Miyazaki, Tomoki Hayashi


Diverse Modes of Speech Acquisition and Processing


Segment and Tone Production in Continuous Speech of Hearing and Hearing-Impaired Children
Shu-Chuan Tseng, Yi-Fen Liu

Effect of Carrier Bandwidth on Understanding Mandarin Sentences in Simulated Electric-Acoustic Hearing
Feng Wang, Jing Chen, Fei Chen

A Comparative Study of Different EMG Features for Acoustics-to-EMG Mapping
Manthan Sharma, Navaneetha Gaddam, Tejas Umesh, Aditya Murthy, Prasanta Kumar Ghosh

Image-Based Assessment of Jaw Parameters and Jaw Kinematics for Articulatory Simulation: Preliminary Results
Ajish K. Abraham, V. Sivaramakrishnan, N. Swapna, N. Manohar

An Attention Self-Supervised Contrastive Learning Based Three-Stage Model for Hand Shape Feature Representation in Cued Speech
Jianrong Wang, Nan Gu, Mei Yu, Xuewei Li, Qiang Fang, Li Liu

Remote Smartphone-Based Speech Collection: Acceptance and Barriers in Individuals with Major Depressive Disorder
Judith Dineley, Grace Lavelle, Daniel Leightley, Faith Matcham, Sara Siddi, Maria Teresa Peñarrubia-María, Katie M. White, Alina Ivan, Carolin Oetzmann, Sara Simblett, Erin Dawe-Lane, Stuart Bruce, Daniel Stahl, Yatharth Ranjan, Zulqarnain Rashid, Pauline Conde, Amos A. Folarin, Josep Maria Haro, Til Wykes, Richard J.B. Dobson, Vaibhav A. Narayan, Matthew Hotopf, Björn W. Schuller, Nicholas Cummins, The RADAR-CNS Consortium

An Automatic, Simple Ultrasound Biofeedback Parameter for Distinguishing Accurate and Misarticulated Rhotic Syllables
Sarah R. Li, Colin T. Annand, Sarah Dugan, Sarah M. Schwab, Kathryn J. Eary, Michael Swearengen, Sarah Stack, Suzanne Boyce, Michael A. Riley, T. Douglas Mast

Silent versus Modal Multi-Speaker Speech Recognition from Ultrasound and Video
Manuel Sam Ribeiro, Aciel Eshky, Korin Richmond, Steve Renals

RaSSpeR: Radar-Based Silent Speech Recognition
David Ferreira, Samuel Silva, Francisco Curado, António Teixeira

Investigating Speech Reconstruction for Laryngectomees for Silent Speech Interfaces
Beiming Cao, Nordine Sebkhi, Arpan Bhavsar, Omer T. Inan, Robin Samlan, Ted Mau, Jun Wang



Self-Supervision and Semi-Supervision for Neural ASR Training


Improving Streaming Transformer Based ASR Under a Framework of Self-Supervised Learning
Songjun Cao, Yueteng Kang, Yanzhe Fu, Xiaoshuo Xu, Sining Sun, Yike Zhang, Long Ma

wav2vec-C: A Self-Supervised Model for Speech Representation Learning
Samik Sadhu, Di He, Che-Wei Huang, Sri Harish Mallidi, Minhua Wu, Ariya Rastrow, Andreas Stolcke, Jasha Droppo, Roland Maas

On the Learning Dynamics of Semi-Supervised Training for ASR
Electra Wallington, Benji Kershenbaum, Ondřej Klejch, Peter Bell

Robust wav2vec 2.0: Analyzing Domain Shift in Self-Supervised Pre-Training
Wei-Ning Hsu, Anuroop Sriram, Alexei Baevski, Tatiana Likhomanenko, Qiantong Xu, Vineel Pratap, Jacob Kahn, Ann Lee, Ronan Collobert, Gabriel Synnaeve, Michael Auli

Momentum Pseudo-Labeling for Semi-Supervised Speech Recognition
Yosuke Higuchi, Niko Moritz, Jonathan Le Roux, Takaaki Hori

A Comparison of Supervised and Unsupervised Pre-Training of End-to-End Models
Ananya Misra, Dongseong Hwang, Zhouyuan Huo, Shefali Garg, Nikhil Siddhartha, Arun Narayanan, Khe Chai Sim

Semi-Supervision in ASR: Sequential MixMatch and Factorized TTS-Based Augmentation
Zhehuai Chen, Andrew Rosenberg, Yu Zhang, Heiga Zen, Mohammadreza Ghodsi, Yinghui Huang, Jesse Emond, Gary Wang, Bhuvana Ramabhadran, Pedro J. Moreno

slimIPL: Language-Model-Free Iterative Pseudo-Labeling
Tatiana Likhomanenko, Qiantong Xu, Jacob Kahn, Gabriel Synnaeve, Ronan Collobert

Phonetically Motivated Self-Supervised Speech Representation Learning
Xianghu Yue, Haizhou Li

Improving RNN-T for Domain Scaling Using Semi-Supervised Training with Neural TTS
Yan Deng, Rui Zhao, Zhong Meng, Xie Chen, Bing Liu, Jinyu Li, Yifan Gong, Lei He





The First DiCOVA Challenge: Diagnosis of COVID-19 Using Acoustics


DiCOVA Challenge: Dataset, Task, and Baseline System for COVID-19 Diagnosis Using Acoustics
Ananya Muguli, Lancelot Pinto, Nirmala R, Neeraj Sharma, Prashant Krishnan, Prasanta Kumar Ghosh, Rohit Kumar, Shrirama Bhat, Srikanth Raj Chetupalli, Sriram Ganapathy, Shreyas Ramoji, Viral Nanda

PANACEA Cough Sound-Based Diagnosis of COVID-19 for the DiCOVA 2021 Challenge
Madhu R. Kamble, Jose A. Gonzalez-Lopez, Teresa Grau, Juan M. Espin, Lorenzo Cascioli, Yiqing Huang, Alejandro Gomez-Alanis, Jose Patino, Roberto Font, Antonio M. Peinado, Angel M. Gomez, Nicholas Evans, Maria A. Zuluaga, Massimiliano Todisco

Recognising Covid-19 from Coughing Using Ensembles of SVMs and LSTMs with Handcrafted and Deep Audio Features
Vincent Karas, Björn W. Schuller

Detecting COVID-19 from Audio Recording of Coughs Using Random Forests and Support Vector Machines
Isabella Södergren, Maryam Pahlavan Nodeh, Prakash Chandra Chhipa, Konstantina Nikolaidou, György Kovács

Diagnosis of COVID-19 Using Auditory Acoustic Cues
Rohan Kumar Das, Maulik Madhavi, Haizhou Li

Classification of COVID-19 from Cough Using Autoregressive Predictive Coding Pretraining and Spectral Data Augmentation
John Harvill, Yash R. Wani, Mark Hasegawa-Johnson, Narendra Ahuja, David Beiser, David Chestek

The DiCOVA 2021 Challenge — An Encoder-Decoder Approach for COVID-19 Recognition from Coughing Audio
Gauri Deshpande, Björn W. Schuller

COVID-19 Detection from Spectral Features on the DiCOVA Dataset
Kotra Venkata Sai Ritwik, Shareef Babu Kalluri, Deepu Vijayasenan

Cough-Based COVID-19 Detection with Contextual Attention Convolutional Neural Networks and Gender Information
Adria Mallol-Ragolta, Helena Cuesta, Emilia Gómez, Björn W. Schuller

Contrastive Learning of Cough Descriptors for Automatic COVID-19 Preliminary Diagnosis
Swapnil Bhosale, Upasana Tiwari, Rupayan Chakraborty, Sunil Kumar Kopparapu

Investigating Feature Selection and Explainability for COVID-19 Diagnostics from Cough Sounds
Flavio Avila, Amir H. Poorjam, Deepak Mittal, Charles Dognin, Ananya Muguli, Rohit Kumar, Srikanth Raj Chetupalli, Sriram Ganapathy, Maneesh Singh







Robust Speaker Recognition


Unsupervised Bayesian Adaptation of PLDA for Speaker Verification
Bengt J. Borgström

The DKU-Duke-Lenovo System Description for the Fearless Steps Challenge Phase III
Weiqing Wang, Danwei Cai, Jin Wang, Qingjian Lin, Xuyang Wang, Mi Hong, Ming Li

Improved Meta-Learning Training for Speaker Verification
Yafeng Chen, Wu Guo, Bin Gu

Variational Information Bottleneck Based Regularization for Speaker Recognition
Dan Wang, Yuanjie Dong, Yaxing Li, Yunfei Zi, Zhihui Zhang, Xiaoqi Li, Shengwu Xiong

Out of a Hundred Trials, How Many Errors Does Your Speaker Verifier Make?
Niko Brümmer, Luciana Ferrer, Albert Swart

SpeakerStew: Scaling to Many Languages with a Triaged Multilingual Text-Dependent and Text-Independent Speaker Verification System
Roza Chojnacka, Jason Pelecanos, Quan Wang, Ignacio Lopez Moreno

AntVoice Neural Speaker Embedding System for FFSVC 2020
Zhiming Wang, Furong Xu, Kaisheng Yao, Yuan Cheng, Tao Xiong, Huijia Zhu

Gradient Regularization for Noise-Robust Speaker Verification
Jianchen Li, Jiqing Han, Hongwei Song

Deep Feature CycleGANs: Speaker Identity Preserving Non-Parallel Microphone-Telephone Domain Adaptation for Speaker Verification
Saurabh Kataria, Jesús Villalba, Piotr Żelasko, Laureano Moro-Velázquez, Najim Dehak

Scaling Effect of Self-Supervised Speech Models
Jie Pu, Yuguang Yang, Ruirui Li, Oguz Elibol, Jasha Droppo

Joint Feature Enhancement and Speaker Recognition with Multi-Objective Task-Oriented Network
Yibo Wu, Longbiao Wang, Kong Aik Lee, Meng Liu, Jianwu Dang

Multi-Level Transfer Learning from Near-Field to Far-Field Speaker Verification
Li Zhang, Qing Wang, Kong Aik Lee, Lei Xie, Haizhou Li

Speaker Anonymisation Using the McAdams Coefficient
Jose Patino, Natalia Tomashenko, Massimiliano Todisco, Andreas Nautsch, Nicholas Evans


Source Separation, Dereverberation and Echo Cancellation


Multi-Stream Gated and Pyramidal Temporal Convolutional Neural Networks for Audio-Visual Speech Separation in Multi-Talker Environments
Yiyu Luo, Jing Wang, Liang Xu, Lidong Yang

TeCANet: Temporal-Contextual Attention Network for Environment-Aware Speech Dereverberation
Helin Wang, Bo Wu, Lianwu Chen, Meng Yu, Jianwei Yu, Yong Xu, Shi-Xiong Zhang, Chao Weng, Dan Su, Dong Yu

Residual Echo and Noise Cancellation with Feature Attention Module and Multi-Domain Loss Function
Jianjun Gu, Longbiao Cheng, Xingwei Sun, Junfeng Li, Yonghong Yan

MIMO Self-Attentive RNN Beamformer for Multi-Speaker Speech Separation
Xiyun Li, Yong Xu, Meng Yu, Shi-Xiong Zhang, Jiaming Xu, Bo Xu, Dong Yu

Personalized PercepNet: Real-Time, Low-Complexity Target Voice Separation and Enhancement
Ritwik Giri, Shrikant Venkataramani, Jean-Marc Valin, Umut Isik, Arvindh Krishnaswamy

Scene-Agnostic Multi-Microphone Speech Dereverberation
Yochai Yemini, Ethan Fetaya, Haggai Maron, Sharon Gannot

Manifold-Aware Deep Clustering: Maximizing Angles Between Embedding Vectors Based on Regular Simplex
Keitaro Tanaka, Ryosuke Sawata, Shusuke Takahashi

A Deep Learning Approach to Multi-Channel and Multi-Microphone Acoustic Echo Cancellation
Hao Zhang, DeLiang Wang

Joint Online Multichannel Acoustic Echo Cancellation, Speech Dereverberation and Source Separation
Yueyue Na, Ziteng Wang, Zhang Liu, Biao Tian, Qiang Fu

Should We Always Separate?: Switching Between Enhanced and Observed Signals for Overlapping Speech Recognition
Hiroshi Sato, Tsubasa Ochiai, Marc Delcroix, Keisuke Kinoshita, Takafumi Moriya, Naoyuki Kamo


Speech Signal Analysis and Representation I


Estimating Articulatory Movements in Speech Production with Transformer Networks
Sathvik Udupa, Anwesha Roy, Abhayjeet Singh, Aravind Illa, Prasanta Kumar Ghosh

Unsupervised Multi-Target Domain Adaptation for Acoustic Scene Classification
Dongchao Yang, Helin Wang, Yuexian Zou

Speech Decomposition Based on a Hybrid Speech Model and Optimal Segmentation
Alfredo Esquivel Jaramillo, Jesper Kjær Nielsen, Mads Græsbøll Christensen

Dropout Regularization for Self-Supervised Learning of Transformer Encoder Speech Representation
Jian Luo, Jianzong Wang, Ning Cheng, Jing Xiao

Noise Robust Pitch Stylization Using Minimum Mean Absolute Error Criterion
Chiranjeevi Yarra, Prasanta Kumar Ghosh

An Attribute-Aligned Strategy for Learning Speech Representation
Yu-Lin Huang, Bo-Hao Su, Y.-W. Peter Hong, Chi-Chun Lee

Raw Speech-to-Articulatory Inversion by Temporal Filtering and Decimation
Abdolreza Sabzi Shahrebabaki, Sabato Marco Siniscalchi, Torbjørn Svendsen

Unsupervised Training of a DNN-Based Formant Tracker
Jason Lilley, H. Timothy Bunnell

SUPERB: Speech Processing Universal PERformance Benchmark
Shu-wen Yang, Po-Han Chi, Yung-Sung Chuang, Cheng-I Jeff Lai, Kushal Lakhotia, Yist Y. Lin, Andy T. Liu, Jiatong Shi, Xuankai Chang, Guan-Ting Lin, Tzu-Hsien Huang, Wei-Cheng Tseng, Ko-tik Lee, Da-Rong Liu, Zili Huang, Shuyan Dong, Shang-Wen Li, Shinji Watanabe, Abdelrahman Mohamed, Hung-yi Lee

Synchronising Speech Segments with Musical Beats in Mandarin and English Singing
Cong Zhang, Jian Zhu

FRILL: A Non-Semantic Speech Embedding for Mobile Devices
Jacob Peplinski, Joel Shor, Sachin Joglekar, Jake Garrison, Shwetak Patel

Pitch Contour Separation from Overlapping Speech
Hiroki Mori

Do Sound Event Representations Generalize to Other Audio Tasks? A Case Study in Audio Transfer Learning
Anurag Kumar, Yun Wang, Vamsi Krishna Ithapu, Christian Fuegen


Spoken Language Understanding I


Data Augmentation for Spoken Language Understanding via Pretrained Language Models
Baolin Peng, Chenguang Zhu, Michael Zeng, Jianfeng Gao

FANS: Fusing ASR and NLU for On-Device SLU
Martin Radfar, Athanasios Mouchtaris, Siegfried Kunzmann, Ariya Rastrow

Sequential End-to-End Intent and Slot Label Classification and Localization
Yiran Cao, Nihal Potdar, Anderson R. Avila

DEXTER: Deep Encoding of External Knowledge for Named Entity Recognition in Virtual Assistants
Deepak Muralidharan, Joel Ruben Antony Moniz, Weicheng Zhang, Stephen Pulman, Lin Li, Megan Barnes, Jingjing Pan, Jason Williams, Alex Acero

A Context-Aware Hierarchical BERT Fusion Network for Multi-Turn Dialog Act Detection
Ting-Wei Wu, Ruolin Su, Biing-Hwang Juang

Pre-Training for Spoken Language Understanding with Joint Textual and Phonetic Representation Learning
Qian Chen, Wen Wang, Qinglin Zhang

Predicting Temporal Performance Drop of Deployed Production Spoken Language Understanding Models
Quynh Do, Judith Gaspers, Daniil Sorokin, Patrick Lehnen

Integrating Dialog History into End-to-End Spoken Language Understanding Systems
Jatin Ganhotra, Samuel Thomas, Hong-Kwang J. Kuo, Sachindra Joshi, George Saon, Zoltán Tüske, Brian Kingsbury

Coreference Augmentation for Multi-Domain Task-Oriented Dialogue State Tracking
Ting Han, Chongxuan Huang, Wei Peng

Rethinking End-to-End Evaluation of Decomposable Tasks: A Case Study on Spoken Language Understanding
Siddhant Arora, Alissa Ostapenko, Vijay Viswanathan, Siddharth Dalmia, Florian Metze, Shinji Watanabe, Alan W. Black


Topics in ASR: Adaptation, Transfer Learning, Children’s Speech, and Low-Resource Settings


Semantic Data Augmentation for End-to-End Mandarin Speech Recognition
Jianwei Sun, Zhiyuan Tang, Hengxin Yin, Wei Wang, Xi Zhao, Shuaijiang Zhao, Xiaoning Lei, Wei Zou, Xiangang Li

Layer-Wise Fast Adaptation for End-to-End Multi-Accent Speech Recognition
Xun Gong, Yizhou Lu, Zhikai Zhou, Yanmin Qian

Low Resource German ASR with Untranscribed Data Spoken by Non-Native Children — INTERSPEECH 2021 Shared Task SPAPL System
Jinhan Wang, Yunzheng Zhu, Ruchao Fan, Wei Chu, Abeer Alwan

Robust Continuous On-Device Personalization for Automatic Speech Recognition
Khe Chai Sim, Angad Chandorkar, Fan Gao, Mason Chua, Tsendsuren Munkhdalai, Françoise Beaufays

Speaker Normalization Using Joint Variational Autoencoder
Shashi Kumar, Shakti P. Rath, Abhishek Pandey

The TAL System for the INTERSPEECH2021 Shared Task on Automatic Speech Recognition for Non-Native Childrens Speech
Gaopeng Xu, Song Yang, Lu Ma, Chengfei Li, Zhongqin Wu

On-the-Fly Aligned Data Augmentation for Sequence-to-Sequence ASR
Tsz Kin Lam, Mayumi Ohta, Shigehiko Schamoni, Stefan Riezler

Zero-Shot Cross-Lingual Phonetic Recognition with External Language Embedding
Heting Gao, Junrui Ni, Yang Zhang, Kaizhi Qian, Shiyu Chang, Mark Hasegawa-Johnson

Rapid Speaker Adaptation for Conformer Transducer: Attention and Bias Are All You Need
Yan Huang, Guoli Ye, Jinyu Li, Yifan Gong

Best of Both Worlds: Robust Accented Speech Recognition with Adversarial Transfer Learning
Nilaksh Das, Sravan Bodapati, Monica Sunkara, Sundararajan Srinivasan, Duen Horng Chau

Extending Pronunciation Dictionary with Automatically Detected Word Mispronunciations to Improve PAII’s System for Interspeech 2021 Non-Native Child English Close Track ASR Challenge
Wei Chu, Peng Chang, Jing Xiao


Voice Conversion and Adaptation I


CVC: Contrastive Learning for Non-Parallel Voice Conversion
Tingle Li, Yichen Liu, Chenxu Hu, Hang Zhao

A Preliminary Study of a Two-Stage Paradigm for Preserving Speaker Identity in Dysarthric Voice Conversion
Wen-Chin Huang, Kazuhiro Kobayashi, Yu-Huai Peng, Ching-Feng Liu, Yu Tsao, Hsin-Min Wang, Tomoki Toda

One-Shot Voice Conversion with Speaker-Agnostic StarGAN
Sefik Emre Eskimez, Dimitrios Dimitriadis, Kenichi Kumatani, Robert Gmyr

Fine-Tuning Pre-Trained Voice Conversion Model for Adding New Target Speakers with Limited Data
Takeshi Koshizuka, Hidefumi Ohmura, Kouichi Katsurada

VQMIVC: Vector Quantization and Mutual Information-Based Unsupervised Speech Representation Disentanglement for One-Shot Voice Conversion
Disong Wang, Liqun Deng, Yu Ting Yeung, Xiao Chen, Xunying Liu, Helen Meng

StarGANv2-VC: A Diverse, Unsupervised, Non-Parallel Framework for Natural-Sounding Voice Conversion
Yinghao Aaron Li, Ali Zare, Nima Mesgarani

Normalization Driven Zero-Shot Multi-Speaker Speech Synthesis
Neeraj Kumar, Srishti Goel, Ankur Narang, Brejesh Lall

StarGAN-VC+ASR: StarGAN-Based Non-Parallel Voice Conversion Regularized by Automatic Speech Recognition
Shoki Sakamoto, Akira Taniguchi, Tadahiro Taniguchi, Hirokazu Kameoka

Two-Pathway Style Embedding for Arbitrary Voice Conversion
Xuexin Xu, Liang Shi, Jinhui Chen, Xunquan Chen, Jie Lian, Pingyuan Lin, Zhihong Zhang, Edwin R. Hancock

Non-Parallel Any-to-Many Voice Conversion by Replacing Speaker Statistics
Yufei Liu, Chengzhu Yu, Wang Shuai, Zhenchuan Yang, Yang Chao, Weibin Zhang

Cross-Lingual Voice Conversion with a Cycle Consistency Loss on Linguistic Representation
Yi Zhou, Xiaohai Tian, Zhizheng Wu, Haizhou Li

Improving Robustness of One-Shot Voice Conversion with Deep Discriminative Speaker Encoder
Hongqiang Du, Lei Xie







Low-Resource Speech Recognition


Low Resource ASR: The Surprising Effectiveness of High Resource Transliteration
Shreya Khare, Ashish Mittal, Anuj Diwan, Sunita Sarawagi, Preethi Jyothi, Samarth Bharadwaj

Unsupervised Acoustic Unit Discovery by Leveraging a Language-Independent Subword Discriminative Feature Representation
Siyuan Feng, Piotr Żelasko, Laureano Moro-Velázquez, Odette Scharenborg

Towards Unsupervised Phone and Word Segmentation Using Self-Supervised Vector-Quantized Neural Networks
Herman Kamper, Benjamin van Niekerk

Speech SimCLR: Combining Contrastive and Reconstruction Objective for Self-Supervised Speech Representation Learning
Dongwei Jiang, Wubo Li, Miao Cao, Wei Zou, Xiangang Li

Multilingual Transfer of Acoustic Word Embeddings Improves When Training on Languages Related to the Target Zero-Resource Language
Christiaan Jacobs, Herman Kamper

Analyzing Speaker Information in Self-Supervised Models to Improve Zero-Resource Speech Processing
Benjamin van Niekerk, Leanne Nortje, Matthew Baas, Herman Kamper

Unsupervised Neural-Based Graph Clustering for Variable-Length Speech Representation Discovery of Zero-Resource Languages
Shun Takahashi, Sakriani Sakti, Satoshi Nakamura

Speech Representation Learning Combining Conformer CPC with Deep Cluster for the ZeroSpeech Challenge 2021
Takashi Maekaku, Xuankai Chang, Yuya Fujita, Li-Wei Chen, Shinji Watanabe, Alexander Rudnicky

Identifying Indicators of Vulnerability from Short Speech Segments Using Acoustic and Textual Features
Xia Cui, Amila Gamage, Terry Hanley, Tingting Mu

The Zero Resource Speech Challenge 2021: Spoken Language Modelling
Ewan Dunbar, Mathieu Bernard, Nicolas Hamilakis, Tu Anh Nguyen, Maureen de Seyssel, Patricia Rozé, Morgane Rivière, Eugene Kharitonov, Emmanuel Dupoux

Zero-Shot Federated Learning with New Classes for Audio Classification
Gautham Krishna Gudur, Satheesh Kumar Perepu

AVLnet: Learning Audio-Visual Language Representations from Instructional Videos
Andrew Rouditchenko, Angie Boggust, David Harwath, Brian Chen, Dhiraj Joshi, Samuel Thomas, Kartik Audhkhasi, Hilde Kuehne, Rameswar Panda, Rogerio Feris, Brian Kingsbury, Michael Picheny, Antonio Torralba, James Glass


Speech Synthesis: Singing, Multimodal, Crosslingual Synthesis


N-Singer: A Non-Autoregressive Korean Singing Voice Synthesis System for Pronunciation Enhancement
Gyeong-Hoon Lee, Tae-Woo Kim, Hanbin Bae, Min-Ji Lee, Young-Ik Kim, Hoon-Young Cho

Cross-Lingual Low Resource Speaker Adaptation Using Phonological Features
Georgia Maniati, Nikolaos Ellinas, Konstantinos Markopoulos, Georgios Vamvoukakis, June Sig Sung, Hyoungmin Park, Aimilios Chalamandaris, Pirros Tsiakoulis

Improve Cross-Lingual Text-To-Speech Synthesis on Monolingual Corpora with Pitch Contour Information
Haoyue Zhan, Haitong Zhang, Wenjie Ou, Yue Lin

Cross-Lingual Voice Conversion with Disentangled Universal Linguistic Representations
Zhenchuan Yang, Weibin Zhang, Yufei Liu, Xiaofen Xing

EfficientSing: A Chinese Singing Voice Synthesis System Using Duration-Free Acoustic Model and HiFi-GAN Vocoder
Zhengchen Liu, Chenfeng Miao, Qingying Zhu, Minchuan Chen, Jun Ma, Shaojun Wang, Jing Xiao

Cross-Lingual Speaker Adaptation Using Domain Adaptation and Speaker Consistency Loss for Text-To-Speech Synthesis
Detai Xin, Yuki Saito, Shinnosuke Takamichi, Tomoki Koriyama, Hiroshi Saruwatari

Incorporating Cross-Speaker Style Transfer for Multi-Language Text-to-Speech
Zengqiang Shang, Zhihua Huang, Haozhe Zhang, Pengyuan Zhang, Yonghong Yan

Investigating Contributions of Speech and Facial Landmarks for Talking Head Generation
Ege Kesim, Engin Erzin

Speech2Video: Cross-Modal Distillation for Speech to Video Generation
Shijing Si, Jianzong Wang, Xiaoyang Qu, Ning Cheng, Wenqi Wei, Xinghua Zhu, Jing Xiao


Speech Coding and Privacy


NU-Wave: A Diffusion Probabilistic Model for Neural Audio Upsampling
Junhyeok Lee, Seungu Han

QISTA-Net-Audio: Audio Super-Resolution via Non-Convex ℓ_q-Norm Minimization
Gang-Xuan Lin, Shih-Wei Hu, Yen-Ju Lu, Yu Tsao, Chun-Shien Lu

X-net: A Joint Scale Down and Scale Up Method for Voice Call
Liang Wen, Lizhong Wang, Xue Wen, Yuxing Zheng, Youngo Park, Kwang Pyo Choi

WSRGlow: A Glow-Based Waveform Generative Model for Audio Super-Resolution
Kexun Zhang, Yi Ren, Changliang Xu, Zhou Zhao

Half-Truth: A Partially Fake Audio Detection Dataset
Jiangyan Yi, Ye Bai, Jianhua Tao, Haoxin Ma, Zhengkun Tian, Chenglong Wang, Tao Wang, Ruibo Fu

Data Quality as Predictor of Voice Anti-Spoofing Generalization
Bhusan Chettri, Rosa González Hautamäki, Md. Sahidullah, Tomi Kinnunen

Coded Speech Enhancement Using Neural Network-Based Vector-Quantized Residual Features
Youngju Cheon, Soojoong Hwang, Sangwook Han, Inseon Jang, Jong Won Shin

Multi-Channel Opus Compression for Far-Field Automatic Speech Recognition with a Fixed Bitrate Budget
Lukas Drude, Jahn Heymann, Andreas Schwarz, Jean-Marc Valin

Effects of Prosodic Variations on Accidental Triggers of a Commercial Voice Assistant
Ingo Siegert

Improving the Expressiveness of Neural Vocoding with Non-Affine Normalizing Flows
Adam Gabryś, Yunlong Jiao, Viacheslav Klimkov, Daniel Korzekwa, Roberto Barra-Chicote

Voice Privacy Through x-Vector and CycleGAN-Based Anonymization
Gauri P. Prajapati, Dipesh K. Singh, Preet P. Amin, Hemant A. Patil

A Two-Stage Approach to Speech Bandwidth Extension
Ju Lin, Yun Wang, Kaustubh Kalgaonkar, Gil Keren, Didi Zhang, Christian Fuegen

Development of a Psychoacoustic Loss Function for the Deep Neural Network (DNN)-Based Speech Coder
Joon Byun, Seungmin Shin, Youngcheol Park, Jongmo Sung, Seungkwon Beack

Protecting Gender and Identity with Disentangled Speech Representations
Dimitrios Stoidis, Andrea Cavallaro


Speech Perception II


Perception of Standard Arabic Synthetic Speech Rate
Yahya Aldholmi, Rawan Aldhafyan, Asma Alqahtani

The Influence of Parallel Processing on Illusory Vowels
Takeshi Kishiyama

Exploring the Potential of Lexical Paraphrases for Mitigating Noise-Induced Comprehension Errors
Anupama Chingacham, Vera Demberg, Dietrich Klakow

SpeechAdjuster: A Tool for Investigating Listener Preferences and Speech Intelligibility
Olympia Simantiraki, Martin Cooke

VocalTurk: Exploring Feasibility of Crowdsourced Speaker Identification
Susumu Saito, Yuta Ide, Teppei Nakano, Tetsuji Ogawa

Effects of Aging and Age-Related Hearing Loss on Talker Discrimination
Min Xu, Jing Shao, Lan Wang

Relationships Between Perceptual Distinctiveness, Articulatory Complexity and Functional Load in Speech Communication
Yuqing Zhang, Zhu Li, Bin Wu, Yanlu Xie, Binghuai Lin, Jinsong Zhang

Human Spoofing Detection Performance on Degraded Speech
Camryn Terblanche, Philip Harrison, Amelia J. Gully

Reliable Estimates of Interpretable Cue Effects with Active Learning in Psycholinguistic Research
Marieke Einfeldt, Rita Sevastjanova, Katharina Zahner-Ritter, Ekaterina Kazak, Bettina Braun

Towards the Explainability of Multimodal Speech Emotion Recognition
Puneet Kumar, Vishesh Kaushik, Balasubramanian Raman

Primacy of Mouth over Eyes: Eye Movement Evidence from Audiovisual Mandarin Lexical Tones and Vowels
Biao Zeng, Rui Wang, Guoxing Yu, Christian Dobel

Investigating the Impact of Spectral and Temporal Degradation on End-to-End Automatic Speech Recognition Performance
Takanori Ashihara, Takafumi Moriya, Makio Kashino


Streaming for ASR/RNN Transducers


Super-Human Performance in Online Low-Latency Recognition of Conversational Speech
Thai-Son Nguyen, Sebastian Stüker, Alex Waibel

Multiple Softmax Architecture for Streaming Multilingual End-to-End ASR Systems
Vikas Joshi, Amit Das, Eric Sun, Rupesh R. Mehta, Jinyu Li, Yifan Gong

Contextualized Streaming End-to-End Speech Recognition with Trie-Based Deep Biasing and Shallow Fusion
Duc Le, Mahaveer Jain, Gil Keren, Suyoun Kim, Yangyang Shi, Jay Mahadeokar, Julian Chan, Yuan Shangguan, Christian Fuegen, Ozlem Kalinli, Yatharth Saraf, Michael L. Seltzer

An Efficient Streaming Non-Recurrent On-Device End-to-End Model with Improvements to Rare-Word Modeling
Tara N. Sainath, Yanzhang He, Arun Narayanan, Rami Botros, Ruoming Pang, David Rybach, Cyril Allauzen, Ehsan Variani, James Qin, Quoc-Nam Le-The, Shuo-Yiin Chang, Bo Li, Anmol Gulati, Jiahui Yu, Chung-Cheng Chiu, Diamantino Caseiro, Wei Li, Qiao Liang, Pat Rondon

Streaming Multi-Talker Speech Recognition with Joint Speaker Identification
Liang Lu, Naoyuki Kanda, Jinyu Li, Yifan Gong

Streaming End-to-End Speech Recognition for Hybrid RNN-T/Attention Architecture
Takafumi Moriya, Tomohiro Tanaka, Takanori Ashihara, Tsubasa Ochiai, Hiroshi Sato, Atsushi Ando, Ryo Masumura, Marc Delcroix, Taichi Asami

Improving RNN-T ASR Accuracy Using Context Audio
Andreas Schwarz, Ilya Sklyar, Simon Wiesler

HMM-Free Encoder Pre-Training for Streaming RNN Transducer
Lu Huang, Jingyu Sun, Yufeng Tang, Junfeng Hou, Jinkun Chen, Jun Zhang, Zejun Ma

Reducing Exposure Bias in Training Recurrent Neural Network Transducers
Xiaodong Cui, Brian Kingsbury, George Saon, David Haws, Zoltán Tüske

Bridging the Gap Between Streaming and Non-Streaming ASR Systems by Distilling Ensembles of CTC and RNN-T Models
Thibault Doutre, Wei Han, Chung-Cheng Chiu, Ruoming Pang, Olivier Siohan, Liangliang Cao

Mixture Model Attention: Flexible Streaming and Non-Streaming Automatic Speech Recognition
Kartik Audhkhasi, Tongzhou Chen, Bhuvana Ramabhadran, Pedro J. Moreno

StableEmit: Selection Probability Discount for Reducing Emission Latency of Streaming Monotonic Attention ASR
Hirofumi Inaguma, Tatsuya Kawahara

Dual Causal/Non-Causal Self-Attention for Streaming End-to-End Speech Recognition
Niko Moritz, Takaaki Hori, Jonathan Le Roux

Multi-Mode Transformer Transducer with Stochastic Future Context
Kwangyoun Kim, Felix Wu, Prashant Sridhar, Kyu J. Han, Shinji Watanabe








Communication and Interaction, Multimodality


Cross-Modal Learning for Audio-Visual Video Parsing
Jatin Lamba, Abhishek, Jayaprakash Akula, Rishabh Dabral, Preethi Jyothi, Ganesh Ramakrishnan

A Psychology-Driven Computational Analysis of Political Interviews
Darren Cook, Miri Zilka, Simon Maskell, Laurence Alison

Speech Emotion Recognition Based on Attention Weight Correction Using Word-Level Confidence Measure
Jennifer Santoso, Takeshi Yamada, Shoji Makino, Kenkichi Ishizuka, Takekatsu Hiramura

Effects of Voice Type and Task on L2 Learners’ Awareness of Pronunciation Errors
Alif Silpachai, Ivana Rehman, Taylor Anne Barriuso, John Levis, Evgeny Chukharev-Hudilainen, Guanlong Zhao, Ricardo Gutierrez-Osuna

Lexical Entrainment and Intra-Speaker Variability in Cooperative Dialogues
Alla Menshikova, Daniil Kocharov, Tatiana Kachkovskaia

Detecting Alzheimer’s Disease Using Interactional and Acoustic Features from Spontaneous Speech
Shamila Nasreen, Julian Hough, Matthew Purver

Investigating the Interplay Between Affective, Phonatory and Motoric Subsystems in Autism Spectrum Disorder Using a Multimodal Dialogue Agent
Hardik Kothare, Vikram Ramanarayanan, Oliver Roesler, Michael Neumann, Jackson Liscombe, William Burke, Andrew Cornish, Doug Habberstad, Alaa Sakallah, Sara Markuson, Seemran Kansara, Afik Faerman, Yasmine Bensidi-Slimane, Laura Fry, Saige Portera, David Suendermann-Oeft, David Pautler, Carly Demopoulos

Analysis of Eye Gaze Reasons and Gaze Aversions During Three-Party Conversations
Carlos Toshinori Ishi, Taiken Shintani


Language and Lexical Modeling for ASR


Semantic Distance: A New Metric for ASR Performance Analysis Towards Spoken Language Understanding
Suyoun Kim, Abhinav Arora, Duc Le, Ching-Feng Yeh, Christian Fuegen, Ozlem Kalinli, Michael L. Seltzer

A Light-Weight Contextual Spelling Correction Model for Customizing Transducer-Based Speech Recognition Systems
Xiaoqiang Wang, Yanqing Liu, Sheng Zhao, Jinyu Li

Incorporating External POS Tagger for Punctuation Restoration
Ning Shi, Wei Wang, Boxin Wang, Jinfeng Li, Xiangyu Liu, Zhouhan Lin

Phonetically Induced Subwords for End-to-End Speech Recognition
Vasileios Papadourakis, Markus Müller, Jing Liu, Athanasios Mouchtaris, Maurizio Omologo

Revisiting Parity of Human vs. Machine Conversational Speech Transcription
Courtney Mansfield, Sara Ng, Gina-Anne Levow, Richard A. Wright, Mari Ostendorf

Lookup-Table Recurrent Language Models for Long Tail Speech Recognition
W. Ronny Huang, Tara N. Sainath, Cal Peyser, Shankar Kumar, David Rybach, Trevor Strohman

Contextual Density Ratio for Language Model Biasing of Sequence to Sequence ASR Systems
Jesús Andrés-Ferrer, Dario Albesano, Puming Zhan, Paul Vozila

Token-Level Supervised Contrastive Learning for Punctuation Restoration
Qiushi Huang, Tom Ko, H. Lilian Tang, Xubo Liu, Bo Wu

BART Based Semantic Correction for Mandarin Automatic Speech Recognition System
Yun Zhao, Xuerui Yang, Jinchao Wang, Yongyu Gao, Chao Yan, Yuanfu Zhou

Class-Based Neural Network Language Model for Second-Pass Rescoring in ASR
Lingfeng Dai, Qi Liu, Kai Yu

Improving Customization of Neural Transducers by Mitigating Acoustic Mismatch of Synthesized Audio
Gakuto Kurata, George Saon, Brian Kingsbury, David Haws, Zoltán Tüske

A Discriminative Entity-Aware Language Model for Virtual Assistants
Mandana Saebi, Ernest Pusateri, Aaksha Meghawat, Christophe Van Gysel

Correcting Automated and Manual Speech Transcription Errors Using Warped Language Models
Mahdi Namazifar, John Malik, Li Erran Li, Gokhan Tur, Dilek Hakkani Tür


Novel Neural Network Architectures for ASR


Dynamic Encoder Transducer: A Flexible Solution for Trading Off Accuracy for Latency
Yangyang Shi, Varun Nagaraja, Chunyang Wu, Jay Mahadeokar, Duc Le, Rohit Prabhavalkar, Alex Xiao, Ching-Feng Yeh, Julian Chan, Christian Fuegen, Ozlem Kalinli, Michael L. Seltzer

Domain-Aware Self-Attention for Multi-Domain Neural Machine Translation
Shiqi Zhang, Yan Liu, Deyi Xiong, Pei Zhang, Boxing Chen

Librispeech Transducer Model with Internal Language Model Prior Correction
Albert Zeyer, André Merboldt, Wilfried Michel, Ralf Schlüter, Hermann Ney

A Deliberation-Based Joint Acoustic and Text Decoder
Sepand Mavandadi, Tara N. Sainath, Ke Hu, Zelin Wu

On the Limit of English Conversational Speech Recognition
Zoltán Tüske, George Saon, Brian Kingsbury

Deformable TDNN with Adaptive Receptive Fields for Speech Recognition
Keyu An, Yi Zhang, Zhijian Ou

Transformer-Based End-to-End Speech Recognition with Residual Gaussian-Based Self-Attention
Chengdong Liang, Menglong Xu, Xiao-Lei Zhang

SpeechMoE: Scaling to Large Acoustic Models with Dynamic Routing Mixture of Experts
Zhao You, Shulin Feng, Dan Su, Dong Yu

Online Compressive Transformer for End-to-End Speech Recognition
Chi-Hang Leong, Yu-Han Huang, Jen-Tzung Chien

End to End Transformer-Based Contextual Speech Recognition Based on Pointer Network
Binghuai Lin, Liyuan Wang

A Comparative Study on Neural Architectures and Training Methods for Japanese Speech Recognition
Shigeki Karita, Yotaro Kubo, Michiel Adriaan Unico Bacchiani, Llion Jones

Advanced Long-Context End-to-End Speech Recognition Using Context-Expanded Transformers
Takaaki Hori, Niko Moritz, Chiori Hori, Jonathan Le Roux

Transformer-Based ASR Incorporating Time-Reduction Layer and Fine-Tuning with Self-Knowledge Distillation
Md. Akmal Haidar, Chao Xing, Mehdi Rezagholizadeh

Flexi-Transducer: Optimizing Latency, Accuracy and Compute for Multi-Domain On-Device Scenarios
Jay Mahadeokar, Yangyang Shi, Yuan Shangguan, Chunyang Wu, Alex Xiao, Hang Su, Duc Le, Ozlem Kalinli, Christian Fuegen, Michael L. Seltzer


Speech Localization, Enhancement, and Quality Assessment


Difference in Perceived Speech Signal Quality Assessment Among Monolingual and Bilingual Teenage Students
Przemyslaw Falkowski-Gilski

PILOT: Introducing Transformers for Probabilistic Sound Event Localization
Christopher Schymura, Benedikt Bönninghoff, Tsubasa Ochiai, Marc Delcroix, Keisuke Kinoshita, Tomohiro Nakatani, Shoko Araki, Dorothea Kolossa

Sound Source Localization with Majorization Minimization
Masahito Togami, Robin Scheibler

NISQA: A Deep CNN-Self-Attention Model for Multidimensional Speech Quality Prediction with Crowdsourced Datasets
Gabriel Mittag, Babak Naderi, Assmaa Chehadi, Sebastian Möller

Subjective Evaluation of Noise Suppression Algorithms in Crowdsourcing
Babak Naderi, Ross Cutler

Reliable Intensity Vector Selection for Multi-Source Direction-of-Arrival Estimation Using a Single Acoustic Vector Sensor
Jianhua Geng, Sifan Wang, Juan Li, JingWei Li, Xin Lou

MetricNet: Towards Improved Modeling For Non-Intrusive Speech Quality Assessment
Meng Yu, Chunlei Zhang, Yong Xu, Shi-Xiong Zhang, Dong Yu

CNN-Based Processing of Acoustic and Radio Frequency Signals for Speaker Localization from MAVs
Andrea Toma, Daniele Salvati, Carlo Drioli, Gian Luca Foresti

Assessment of von Mises-Bernoulli Deep Neural Network in Sound Source Localization
Katsutoshi Itoyama, Yoshiya Morimoto, Shungo Masaki, Ryosuke Kojima, Kenji Nishida, Kazuhiro Nakadai

Feature Fusion by Attention Networks for Robust DOA Estimation
Rongliang Liu, Nengheng Zheng, Xi Chen

Far-Field Speaker Localization and Adaptive GLMB Tracking
Shoufeng Lin, Zhaojie Luo

On the Design of Deep Priors for Unsupervised Audio Restoration
Vivek Sivaraman Narayanaswamy, Jayaraman J. Thiagarajan, Andreas Spanias

Cramér-Rao Lower Bound for DOA Estimation with an Array of Directional Microphones in Reverberant Environments
Weiguang Chen, Cheng Xue, Xionghu Zhong


Speech Synthesis: Neural Waveform Generation


GAN Vocoder: Multi-Resolution Discriminator Is All You Need
Jaeseong You, Dalhyun Kim, Gyuhyeon Nam, Geumbyeol Hwang, Gyeongsu Chae

Glow-WaveGAN: Learning Speech Representations from GAN-Based Variational Auto-Encoder for High Fidelity Flow-Based Speech Synthesis
Jian Cong, Shan Yang, Lei Xie, Dan Su

Unified Source-Filter GAN: Unified Source-Filter Network Based On Factorization of Quasi-Periodic Parallel WaveGAN
Reo Yoneyama, Yi-Chiao Wu, Tomoki Toda

Harmonic WaveGAN: GAN-Based Speech Waveform Generation Model with Harmonic Structure Discriminator
Kazuki Mizuta, Tomoki Koriyama, Hiroshi Saruwatari

Fre-GAN: Adversarial Frequency-Consistent Audio Synthesis
Ji-Hoon Kim, Sang-Hoon Lee, Ji-Hyun Lee, Seong-Whan Lee

GANSpeech: Adversarial Training for High-Fidelity Multi-Speaker Speech Synthesis
Jinhyeok Yang, Jae-Sung Bae, Taejun Bak, Young-Ik Kim, Hoon-Young Cho

UnivNet: A Neural Vocoder with Multi-Resolution Spectrogram Discriminators for High-Fidelity Waveform Generation
Won Jang, Dan Lim, Jaesam Yoon, Bongwan Kim, Juntae Kim

Continuous Wavelet Vocoder-Based Decomposition of Parametric Speech Waveform Synthesis
Mohammed Salah Al-Radhi, Tamás Gábor Csapó, Csaba Zainkó, Géza Németh

High-Fidelity and Low-Latency Universal Neural Vocoder Based on Multiband WaveRNN with Data-Driven Linear Prediction for Discrete Waveform Modeling
Patrick Lumban Tobing, Tomoki Toda

Basis-MelGAN: Efficient Neural Vocoder Based on Audio Decomposition
Zhengxi Liu, Yanmin Qian

High-Fidelity Parallel WaveGAN with Multi-Band Harmonic-Plus-Noise Model
Min-Jae Hwang, Ryuichi Yamamoto, Eunwoo Song, Jae-Min Kim


Spoken Machine Translation


SpecRec: An Alternative Solution for Improving End-to-End Speech-to-Text Translation via Spectrogram Reconstruction
Junkun Chen, Mingbo Ma, Renjie Zheng, Liang Huang

Subtitle Translation as Markup Translation
Colin Cherry, Naveen Arivazhagan, Dirk Padfield, Maxim Krikun

Large-Scale Self- and Semi-Supervised Learning for Speech Translation
Changhan Wang, Anne Wu, Juan Pino, Alexei Baevski, Michael Auli, Alexis Conneau

CoVoST 2 and Massively Multilingual Speech Translation
Changhan Wang, Anne Wu, Jiatao Gu, Juan Pino

AlloST: Low-Resource Speech Translation Without Source Transcription
Yao-Fei Cheng, Hung-Shin Lee, Hsin-Min Wang

Weakly-Supervised Speech-to-Text Mapping with Visually Connected Non-Parallel Speech-Text Data Using Cyclic Partially-Aligned Transformer
Johanes Effendi, Sakriani Sakti, Satoshi Nakamura

Transcribing Paralinguistic Acoustic Cues to Target Language Text in Transformer-Based Speech-to-Text Translation
Hirotaka Tokuyama, Sakriani Sakti, Katsuhito Sudoh, Satoshi Nakamura

End-to-End Speech Translation via Cross-Modal Progressive Training
Rong Ye, Mingxuan Wang, Lei Li

ASR Posterior-Based Loss for Multi-Task End-to-End Speech Translation
Yuka Ko, Katsuhito Sudoh, Sakriani Sakti, Satoshi Nakamura

Towards Simultaneous Machine Interpretation
Alejandro Pérez-González-de-Martos, Javier Iranzo-Sánchez, Adrià Giménez Pastor, Javier Jorge, Joan-Albert Silvestre-Cerdà, Jorge Civera, Albert Sanchis, Alfons Juan

Lexical Modeling of ASR Errors for Robust Speech Translation
Giuseppe Martucci, Mauro Cettolo, Matteo Negri, Marco Turchi

Optimally Encoding Inductive Biases into the Transformer Improves End-to-End Speech Translation
Piyush Vyas, Anastasia Kuznetsova, Donald S. Williamson

Effects of Feature Scaling and Fusion on Sign Language Translation
Tejaswini Ananthanarayana, Lipisha Chaudhary, Ifeoma Nwogu







Cross/Multi-Lingual and Code-Switched ASR


Bootstrap an End-to-End ASR System by Multilingual Training, Transfer Learning, Text-to-Text Mapping and Synthetic Audio
Manuel Giollo, Deniz Gunceler, Yulan Liu, Daniel Willett

Efficient Weight Factorization for Multilingual Speech Recognition
Ngoc-Quan Pham, Tuan-Nam Nguyen, Sebastian Stüker, Alex Waibel

Unsupervised Cross-Lingual Representation Learning for Speech Recognition
Alexis Conneau, Alexei Baevski, Ronan Collobert, Abdelrahman Mohamed, Michael Auli

Language and Speaker-Independent Feature Transformation for End-to-End Multilingual Speech Recognition
Tomoaki Hayakawa, Chee Siang Leow, Akio Kobayashi, Takehito Utsuro, Hiromitsu Nishizaki

Using Large Self-Supervised Models for Low-Resource Speech Recognition
Krishna D. N, Pinyi Wang, Bruno Bozza

Dual Script E2E Framework for Multilingual and Code-Switching ASR
Mari Ganesh Kumar, Jom Kuriakose, Anand Thyagachandran, Arun Kumar A, Ashish Seth, Lodagala V.S.V. Durga Prasad, Saish Jaiswal, Anusha Prakash, Hema A. Murthy

MUCS 2021: Multilingual and Code-Switching ASR Challenges for Low Resource Indian Languages
Anuj Diwan, Rakesh Vaideeswaran, Sanket Shah, Ankita Singh, Srinivasa Raghavan, Shreya Khare, Vinit Unni, Saurabh Vyas, Akash Rajpuria, Chiranjeevi Yarra, Ashish Mittal, Prasanta Kumar Ghosh, Preethi Jyothi, Kalika Bali, Vivek Seshadri, Sunayana Sitaram, Samarth Bharadwaj, Jai Nanavati, Raoul Nanavati, Karthik Sankaranarayanan

Adapt-and-Adjust: Overcoming the Long-Tail Problem of Multilingual Speech Recognition
Genta Indra Winata, Guangsen Wang, Caiming Xiong, Steven Hoi

SRI-B End-to-End System for Multilingual and Code-Switching ASR Challenges for Low Resource Indian Languages
Hardik Sailor, Kiran Praveen T, Vikas Agrawal, Abhinav Jain, Abhishek Pandey

Hierarchical Phone Recognition with Compositional Phonetics
Xinjian Li, Juncheng Li, Florian Metze, Alan W. Black

Towards One Model to Rule All: Multilingual Strategy for Dialectal Code-Switching Arabic ASR
Shammur Absar Chowdhury, Amir Hussein, Ahmed Abdelali, Ahmed Ali

Differentiable Allophone Graphs for Language-Universal Speech Recognition
Brian Yan, Siddharth Dalmia, David R. Mortensen, Florian Metze, Shinji Watanabe


Health and Affect II


Automatic Speech Recognition Systems Errors for Objective Sleepiness Detection Through Voice
Vincent P. Martin, Jean-Luc Rouas, Florian Boyer, Pierre Philip

Robust Laughter Detection in Noisy Environments
Jon Gillick, Wesley Deng, Kimiko Ryokai, David Bamman

Impact of Emotional State on Estimation of Willingness to Buy from Advertising Speech
Mizuki Nagano, Yusuke Ijima, Sadao Hiroya

Stacked Recurrent Neural Networks for Speech-Based Inference of Attachment Condition in School Age Children
Huda Alsofyani, Alessandro Vinciarelli

Language or Paralanguage, This is the Problem: Comparing Depressed and Non-Depressed Speakers Through the Analysis of Gated Multimodal Units
Nujud Aloshban, Anna Esposito, Alessandro Vinciarelli

Emotion Carrier Recognition from Personal Narratives
Aniruddha Tammewar, Alessandra Cervone, Giuseppe Riccardi

Non-Verbal Vocalisation and Laughter Detection Using Sequence-to-Sequence Models and Multi-Label Training
Scott Condron, Georgia Clarke, Anita Klementiev, Daniela Morse-Kopp, Jack Parry, Dimitri Palaz

TDCA-Net: Time-Domain Channel Attention Network for Depression Detection
Cong Cai, Mingyue Niu, Bin Liu, Jianhua Tao, Xuefei Liu

Visual Speech for Obstructive Sleep Apnea Detection
Catarina Botelho, Alberto Abad, Tanja Schultz, Isabel Trancoso

Analysis of Contextual Voice Changes in Remote Meetings
Hector A. Cordourier Maruri, Sinem Aslan, Georg Stemmer, Nese Alyuz, Lama Nachman

Speech Based Depression Severity Level Classification Using a Multi-Stage Dilated CNN-LSTM Model
Nadee Seneviratne, Carol Espy-Wilson


Neural Network Training Methods for ASR


Multi-Domain Knowledge Distillation via Uncertainty-Matching for End-to-End ASR Models
Ho-Gyeong Kim, Min-Joong Lee, Hoshik Lee, Tae Gyoon Kang, Jihyun Lee, Eunho Yang, Sung Ju Hwang

Learning a Neural Diff for Speech Models
Jonathan Macoskey, Grant P. Strimel, Ariya Rastrow

Stochastic Attention Head Removal: A Simple and Effective Method for Improving Transformer Based ASR Models
Shucong Zhang, Erfan Loweimi, Peter Bell, Steve Renals

Model-Agnostic Fast Adaptive Multi-Objective Balancing Algorithm for Multilingual Automatic Speech Recognition Model Training
Jiabin Xue, Tieran Zheng, Jiqing Han

Towards Lifelong Learning of End-to-End ASR
Heng-Jui Chang, Hung-yi Lee, Lin-shan Lee

Self-Adaptive Distillation for Multilingual Speech Recognition: Leveraging Student Independence
Isabel Leal, Neeraj Gaur, Parisa Haghani, Brian Farris, Pedro J. Moreno, Manasa Prasad, Bhuvana Ramabhadran, Yun Zhu

Regularizing Word Segmentation by Creating Misspellings
Hainan Xu, Kartik Audhkhasi, Yinghui Huang, Jesse Emond, Bhuvana Ramabhadran

Multitask Training with Text Data for End-to-End Speech Recognition
Peidong Wang, Tara N. Sainath, Ron J. Weiss

Emitting Word Timings with HMM-Free End-to-End System in Automatic Speech Recognition
Xianzhao Chen, Hao Ni, Yi He, Kang Wang, Zejun Ma, Zongxia Xie

Scaling Laws for Acoustic Models
Jasha Droppo, Oguz Elibol

Leveraging Non-Target Language Resources to Improve ASR Performance in a Target Language
Jayadev Billa

4-Bit Quantization of LSTM-Based Speech Recognition Models
Andrea Fasoli, Chia-Yu Chen, Mauricio Serrano, Xiao Sun, Naigang Wang, Swagath Venkataramani, George Saon, Xiaodong Cui, Brian Kingsbury, Wei Zhang, Zoltán Tüske, Kailash Gopalakrishnan

Unified Autoregressive Modeling for Joint End-to-End Multi-Talker Overlapped Speech Recognition and Speaker Attribute Estimation
Ryo Masumura, Daiki Okamura, Naoki Makishima, Mana Ihori, Akihiko Takashima, Tomohiro Tanaka, Shota Orihashi

Minimum Word Error Rate Training with Language Model Fusion for End-to-End Speech Recognition
Zhong Meng, Yu Wu, Naoyuki Kanda, Liang Lu, Xie Chen, Guoli Ye, Eric Sun, Jinyu Li, Yifan Gong

Variable Frame Rate Acoustic Models Using Minimum Error Reinforcement Learning
Dongcheng Jiang, Chao Zhang, Philip C. Woodland


Prosodic Features and Structure


How f0 and Phrase Position Affect Papuan Malay Word Identification
Constantijn Kaland, Matthew Gordon

On the Feasibility of the Danish Model of Intonational Transcription: Phonetic Evidence from Jutlandic Danish
Anna Bothe Jespersen, Pavel Šturm, Míša Hejná

An Experiment in Paratone Detection in a Prosodically Annotated EAP Spoken Corpus
Adrien Méli, Nicolas Ballier, Achille Falaise, Alice Henderson

ProsoBeast Prosody Annotation Tool
Branislav Gerazov, Michael Wagner

Assessing the Use of Prosody in Constituency Parsing of Imperfect Transcripts
Trang Tran, Mari Ostendorf

Targeted and Targetless Neutral Tones in Taiwanese Southern Min
Roger Cheng-yen Liu, Feng-fan Hsieh, Yueh-chin Chang

The Interaction of Word Complexity and Word Duration in an Agglutinative Language
Mária Gósy, Kálmán Abari

Taiwan Min Nan (Taiwanese) Checked Tones Sound Change
Ho-hsien Pan, Shao-ren Lyu

In-Group Advantage in the Perception of Emotions: Evidence from Three Varieties of German
Moritz Jakob, Bettina Braun, Katharina Zahner-Ritter

The LF Model in the Frequency Domain for Glottal Airflow Modelling Without Aliasing Distortion
Christer Gobl

Parsing Speech for Grouping and Prominence, and the Typology of Rhythm
Michael Wagner, Alvaro Iturralde Zurita, Sijia Zhang

Prosody of Case Markers in Urdu
Benazir Mumtaz, Massimiliano Canzi, Miriam Butt

Articulatory Characteristics of Icelandic Voiced Fricative Lenition: Gradience, Categoricity, and Speaker/Gesture-Specific Effects
Brynhildur Stefansdottir, Francesco Burroni, Sam Tilsen

Leveraging the Uniformity Framework to Examine Crosslinguistic Similarity for Long-Lag Stops in Spontaneous Cantonese-English Bilingual Speech
Khia A. Johnson


Single-Channel Speech Enhancement


Personalized Speech Enhancement Through Self-Supervised Data Augmentation and Purification
Aswin Sivaraman, Sunwoo Kim, Minje Kim

Speech Denoising with Auditory Models
Mark R. Saddler, Andrew Francl, Jenelle Feather, Kaizhi Qian, Yang Zhang, Josh H. McDermott

Human Listening and Live Captioning: Multi-Task Training for Speech Enhancement
Sefik Emre Eskimez, Xiaofei Wang, Min Tang, Hemin Yang, Zirun Zhu, Zhuo Chen, Huaming Wang, Takuya Yoshioka

Multi-Stage Progressive Speech Enhancement Network
Xinmeng Xu, Yang Wang, Dongxiang Xu, Yiyuan Peng, Cong Zhang, Jie Jia, Binbin Chen

Single-Channel Speech Enhancement Using Learnable Loss Mixup
Oscar Chang, Dung N. Tran, Kazuhito Koishida

A Maximum Likelihood Approach to SNR-Progressive Learning Using Generalized Gaussian Distribution for LSTM-Based Speech Enhancement
Xiao-Qi Zhang, Jun Du, Li Chai, Chin-Hui Lee

Whisper Speech Enhancement Using Joint Variational Autoencoder for Improved Speech Recognition
Vikas Agrawal, Shashi Kumar, Shakti P. Rath

DEMUCS-Mobile : On-Device Lightweight Speech Enhancement
Lukas Lee, Youna Ji, Minjae Lee, Min-Seok Choi

Speech Denoising Without Clean Training Data: A Noise2Noise Approach
Madhav Mahesh Kashyap, Anuj Tambwekar, Krishnamoorthy Manohara, S. Natarajan

Improved Speech Enhancement Using a Complex-Domain GAN with Fused Time-Domain and Time-Frequency Domain Constraints
Feng Dang, Pengyuan Zhang, Hangting Chen

Speech Enhancement with Topology-Enhanced Generative Adversarial Networks (GANs)
Xudong Zhang, Liang Zhao, Feng Gu

Learning Speech Structure to Improve Time-Frequency Masks
Suliang Bu, Yunxin Zhao, Shaojun Wang, Mei Han

SE-Conformer: Time-Domain Speech Enhancement Using Conformer
Eesung Kim, Hyeji Seo







Assessment of Pathological Speech and Language II


Speech Intelligibility of Dysarthric Speech: Human Scores and Acoustic-Phonetic Features
Wei Xue, Roeland van Hout, Fleur Boogmans, Mario Ganzeboom, Catia Cucchiarini, Helmer Strik

Analyzing Short Term Dynamic Speech Features for Understanding Behavioral Traits of Children with Autism Spectrum Disorder
Young-Kyung Kim, Rimita Lahiri, Md. Nasir, So Hyun Kim, Somer Bishop, Catherine Lord, Shrikanth S. Narayanan

Vocalization Recognition of People with Profound Intellectual and Multiple Disabilities (PIMD) Using Machine Learning Algorithms
Waldemar Jęśko

Phonetic Complexity, Speech Accuracy and Intelligibility Assessment of Italian Dysarthric Speech
Barbara Gili Fivela, Vincenzo Sallustio, Silvia Pede, Danilo Patrocinio

Detection of Consonant Errors in Disordered Speech Based on Consonant-Vowel Segment Embedding
Si-Ioi Ng, Cymie Wing-Yee Ng, Jingyu Li, Tan Lee

Assessing Posterior-Based Mispronunciation Detection on Field-Collected Recordings from Child Speech Therapy Sessions
Adam Hair, Guanlong Zhao, Beena Ahmed, Kirrie J. Ballard, Ricardo Gutierrez-Osuna

Identifying Cognitive Impairment Using Sentence Representation Vectors
Bahman Mirheidari, Yilin Pan, Daniel Blackburn, Ronan O’Malley, Heidi Christensen

Parental Spoken Scaffolding and Narrative Skills in Crowd-Sourced Storytelling Samples of Young Children
Zhengjun Yue, Jon Barker, Heidi Christensen, Cristina McKean, Elaine Ashton, Yvonne Wren, Swapnil Gadgil, Rebecca Bright

Uncertainty-Aware COVID-19 Detection from Imbalanced Sound Data
Tong Xia, Jing Han, Lorena Qendro, Ting Dang, Cecilia Mascolo

Unsupervised Domain Adaptation for Dysarthric Speech Detection via Domain Adversarial Training and Mutual Information Minimization
Disong Wang, Liqun Deng, Yu Ting Yeung, Xiao Chen, Xunying Liu, Helen Meng

Source and Vocal Tract Cues for Speech-Based Classification of Patients with Parkinson’s Disease and Healthy Subjects
Tanuka Bhattacharjee, Jhansi Mallela, Yamini Belur, Nalini Atchayaram, Ravi Yadav, Pradeep Reddy, Dipanjan Gope, Prasanta Kumar Ghosh

CLAC: A Speech Corpus of Healthy English Speakers
R’mani Haulcy, James Glass



Source Separation I


Ultra Fast Speech Separation Model with Teacher Student Learning
Sanyuan Chen, Yu Wu, Zhuo Chen, Jian Wu, Takuya Yoshioka, Shujie Liu, Jinyu Li, Xiangzhan Yu

Group Delay Based Re-Weighted Sparse Recovery Algorithms for Robust and High-Resolution Source Separation in DOA Framework
Murtiza Ali, Ashwani Koul, Karan Nathwani

Continuous Speech Separation Using Speaker Inventory for Long Recording
Cong Han, Yi Luo, Chenda Li, Tianyan Zhou, Keisuke Kinoshita, Shinji Watanabe, Marc Delcroix, Hakan Erdogan, John R. Hershey, Nima Mesgarani, Zhuo Chen

Crossfire Conditional Generative Adversarial Networks for Singing Voice Extraction
Weitao Yuan, Shengbei Wang, Xiangrui Li, Masashi Unoki, Wenwu Wang

End-to-End Speech Separation Using Orthogonal Representation in Complex and Real Time-Frequency Domain
Kai Wang, Hao Huang, Ying Hu, Zhihua Huang, Sheng Li

Efficient and Stable Adversarial Learning Using Unpaired Data for Unsupervised Multichannel Speech Separation
Yu Nakagome, Masahito Togami, Tetsuji Ogawa, Tetsunori Kobayashi

Stabilizing Label Assignment for Speech Separation by Self-Supervised Pre-Training
Sung-Feng Huang, Shun-Po Chuang, Da-Rong Liu, Yi-Chen Chen, Gene-Ping Yang, Hung-yi Lee

Dual-Path Filter Network: Speaker-Aware Modeling for Speech Separation
Fan-Lin Wang, Yu-Huai Peng, Hung-Shin Lee, Hsin-Min Wang

Investigation of Practical Aspects of Single Channel Speech Separation for ASR
Jian Wu, Zhuo Chen, Sanyuan Chen, Yu Wu, Takuya Yoshioka, Naoyuki Kanda, Shujie Liu, Jinyu Li

Implicit Filter-and-Sum Network for End-to-End Multi-Channel Speech Separation
Yi Luo, Nima Mesgarani

Generalized Spatio-Temporal RNN Beamformer for Target Speech Separation
Yong Xu, Zhuohuang Zhang, Meng Yu, Shi-Xiong Zhang, Dong Yu














Multi- and Cross-Lingual ASR, Other Topics in ASR


Cross-Domain Speech Recognition with Unsupervised Character-Level Distribution Matching
Wenxin Hou, Jindong Wang, Xu Tan, Tao Qin, Takahiro Shinozaki

Large-Scale Pre-Training of End-to-End Multi-Talker ASR for Meeting Transcription with Single Distant Microphone
Naoyuki Kanda, Guoli Ye, Yu Wu, Yashesh Gaur, Xiaofei Wang, Zhong Meng, Zhuo Chen, Takuya Yoshioka

On Minimum Word Error Rate Training of the Hybrid Autoregressive Transducer
Liang Lu, Zhong Meng, Naoyuki Kanda, Jinyu Li, Yifan Gong

Reducing Streaming ASR Model Delay with Self Alignment
Jaeyoung Kim, Han Lu, Anshuman Tripathi, Qian Zhang, Hasim Sak

Reduce and Reconstruct: ASR for Low-Resource Phonetic Languages
Anuj Diwan, Preethi Jyothi

Knowledge Distillation Based Training of Universal ASR Source Models for Cross-Lingual Transfer
Takashi Fukuda, Samuel Thomas

Listen with Intent: Improving Speech Recognition with Audio-to-Intent Front-End
Swayambhu Nath Ray, Minhua Wu, Anirudh Raju, Pegah Ghahremani, Raghavendra Bilgi, Milind Rao, Harish Arsikere, Ariya Rastrow, Andreas Stolcke, Jasha Droppo

Exploring Targeted Universal Adversarial Perturbations to End-to-End ASR Models
Zhiyun Lu, Wei Han, Yu Zhang, Liangliang Cao

Earnings-21: A Practical Benchmark for ASR in the Wild
Miguel Del Rio, Natalie Delworth, Ryan Westerman, Michelle Huang, Nishchal Bhandari, Joseph Palakapilly, Quinten McNamara, Joshua Dong, Piotr Żelasko, Miguel Jetté

Improving Multilingual Transformer Transducer Models by Reducing Language Confusions
Eric Sun, Jinyu Li, Zhong Meng, Yu Wu, Jian Xue, Shujie Liu, Yifan Gong

Arabic Code-Switching Speech Recognition Using Monolingual Data
Ahmed Ali, Shammur Absar Chowdhury, Amir Hussein, Yasser Hifny


Source Separation II


Online Blind Audio Source Separation Using Recursive Expectation-Maximization
Aviad Eisenberg, Boaz Schwartz, Sharon Gannot

Empirical Analysis of Generalized Iterative Speech Separation Networks
Yi Luo, Cong Han, Nima Mesgarani

Graph-PIT: Generalized Permutation Invariant Training for Continuous Separation of Arbitrary Numbers of Speakers
Thilo von Neumann, Keisuke Kinoshita, Christoph Boeddeker, Marc Delcroix, Reinhold Haeb-Umbach

Teacher-Student MixIT for Unsupervised and Semi-Supervised Speech Separation
Jisi Zhang, Cătălin Zorilă, Rama Doddipatla, Jon Barker

Few-Shot Learning of New Sound Classes for Target Sound Extraction
Marc Delcroix, Jorge Bennasar Vázquez, Tsubasa Ochiai, Keisuke Kinoshita, Shoko Araki

Binaural Speech Separation of Moving Speakers With Preserved Spatial Cues
Cong Han, Yi Luo, Nima Mesgarani

AvaTr: One-Shot Speaker Extraction with Transformers
Shell Xu Hu, Md. Rifat Arefin, Viet-Nhat Nguyen, Alish Dipani, Xaq Pitkow, Andreas Savas Tolias

Vocal Harmony Separation Using Time-Domain Neural Networks
Saurjya Sarkar, Emmanouil Benetos, Mark Sandler

Speaker Verification-Based Evaluation of Single-Channel Speech Separation
Matthew Maciejewski, Shinji Watanabe, Sanjeev Khudanpur

Improved Speech Separation with Time-and-Frequency Cross-Domain Feature Selection
Tian Lan, Yuxin Qian, Yilan Lyu, Refuoe Mokhosi, Wenxin Tai, Qiao Liu

Robust Speaker Extraction Network Based on Iterative Refined Adaptation
Chengyun Deng, Shiqian Ma, Yongtao Sha, Yi Zhang, Hui Zhang, Hui Song, Fei Wang

Neural Speaker Extraction with Speaker-Speech Cross-Attention Network
Wupeng Wang, Chenglin Xu, Meng Ge, Haizhou Li

Deep Audio-Visual Speech Separation Based on Facial Motion
Rémi Rigal, Jacques Chodorowski, Benoît Zerr



Speech Synthesis: Toward End-to-End Synthesis I


Federated Learning with Dynamic Transformer for Text to Speech
Zhenhou Hong, Jianzong Wang, Xiaoyang Qu, Jie Liu, Chendong Zhao, Jing Xiao

LiteTTS: A Lightweight Mel-Spectrogram-Free Text-to-Wave Synthesizer Based on Generative Adversarial Networks
Huu-Kim Nguyen, Kihyuk Jeong, Seyun Um, Min-Jae Hwang, Eunwoo Song, Hong-Goo Kang

Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration
Chuanxin Tang, Chong Luo, Zhiyuan Zhao, Dacheng Yin, Yucheng Zhao, Wenjun Zeng

Diff-TTS: A Denoising Diffusion Model for Text-to-Speech
Myeonghun Jeong, Hyeongju Kim, Sung Jun Cheon, Byoung Jin Choi, Nam Soo Kim

Hierarchical Context-Aware Transformers for Non-Autoregressive Text to Speech
Jae-Sung Bae, Taejun Bak, Young-Sun Joo, Hoon-Young Cho

Speech Resynthesis from Discrete Disentangled Self-Supervised Representations
Adam Polyak, Yossi Adi, Jade Copet, Eugene Kharitonov, Kushal Lakhotia, Wei-Ning Hsu, Abdelrahman Mohamed, Emmanuel Dupoux

A Learned Conditional Prior for the VAE Acoustic Space of a TTS System
Penny Karanasou, Sri Karlapati, Alexis Moinet, Arnaud Joly, Ammar Abbas, Simon Slangen, Jaime Lorenzo-Trueba, Thomas Drugman

A Universal Multi-Speaker Multi-Style Text-to-Speech via Disentangled Representation Learning Based on Rényi Divergence Minimization
Dipjyoti Paul, Sankar Mukherjee, Yannis Pantazis, Yannis Stylianou

Relational Data Selection for Data Augmentation of Speaker-Dependent Multi-Band MelGAN Vocoder
Yi-Chiao Wu, Cheng-Hung Hu, Hung-Shin Lee, Yu-Huai Peng, Wen-Chin Huang, Yu Tsao, Hsin-Min Wang, Tomoki Toda

Reinforce-Aligner: Reinforcement Alignment Search for Robust End-to-End Text-to-Speech
Hyunseung Chung, Sang-Hoon Lee, Seong-Whan Lee

Triple M: A Practical Text-to-Speech Synthesis System with Multi-Guidance Attention and Multi-Band Multi-Time LPCNet
Shilun Lin, Fenglong Xie, Li Meng, Xinhui Li, Li Lu

SC-GlowTTS: An Efficient Zero-Shot Multi-Speaker Text-To-Speech Model
Edresson Casanova, Christopher Shulby, Eren Gölge, Nicolas Michael Müller, Frederico Santos de Oliveira, Arnaldo Candido Jr., Anderson da Silva Soares, Sandra Maria Aluisio, Moacir Antonelli Ponti


Tools, Corpora and Resources


Spoken ObjectNet: A Bias-Controlled Spoken Caption Dataset
Ian Palmer, Andrew Rouditchenko, Andrei Barbu, Boris Katz, James Glass

The Multilingual TEDx Corpus for Speech Recognition and Translation
Elizabeth Salesky, Matthew Wiesner, Jacob Bremerman, Roldano Cattoni, Matteo Negri, Marco Turchi, Douglas W. Oard, Matt Post

Tusom2021: A Phonetically Transcribed Speech Dataset from an Endangered Language for Universal Phone Recognition Experiments
David R. Mortensen, Jordan Picone, Xinjian Li, Kathleen Siminyu

AISHELL-4: An Open Source Dataset for Speech Enhancement, Separation, Recognition and Speaker Diarization in Conference Scenario
Yihui Fu, Luyao Cheng, Shubo Lv, Yukai Jv, Yuxiang Kong, Zhuo Chen, Yanxin Hu, Lei Xie, Jian Wu, Hui Bu, Xin Xu, Jun Du, Jingdong Chen

GigaSpeech: An Evolving, Multi-Domain ASR Corpus with 10,000 Hours of Transcribed Audio
Guoguo Chen, Shuzhou Chai, Guan-Bo Wang, Jiayu Du, Wei-Qiang Zhang, Chao Weng, Dan Su, Daniel Povey, Jan Trmal, Junbo Zhang, Mingjie Jin, Sanjeev Khudanpur, Shinji Watanabe, Shuaijiang Zhao, Wei Zou, Xiangang Li, Xuchen Yao, Yongqing Wang, Zhao You, Zhiyong Yan

Look Who’s Talking: Active Speaker Detection in the Wild
You Jin Kim, Hee-Soo Heo, Soyeon Choe, Soo-Whan Chung, Yoohwan Kwon, Bong-Jin Lee, Youngki Kwon, Joon Son Chung

AusKidTalk: An Auditory-Visual Corpus of 3- to 12-Year-Old Australian Children’s Speech
Beena Ahmed, Kirrie J. Ballard, Denis Burnham, Tharmakulasingam Sirojan, Hadi Mehmood, Dominique Estival, Elise Baker, Felicity Cox, Joanne Arciuli, Titia Benders, Katherine Demuth, Barbara Kelly, Chloé Diskin-Holdaway, Mostafa Shahin, Vidhyasaharan Sethu, Julien Epps, Chwee Beng Lee, Eliathamby Ambikairajah

Human-in-the-Loop Efficiency Analysis for Binary Classification in Edyson
Per Fallgren, Jens Edlund

Annotation Confidence vs. Training Sample Size: Trade-Off Solution for Partially-Continuous Categorical Emotion Recognition
Elena Ryumina, Oxana Verkholyak, Alexey Karpov

Europarl-ASR: A Large Corpus of Parliamentary Debates for Streaming ASR Benchmarking and Speech Data Filtering/Verbatimization
Gonçal V. Garcés Díaz-Munío, Joan-Albert Silvestre-Cerdà, Javier Jorge, Adrià Giménez Pastor, Javier Iranzo-Sánchez, Pau Baquero-Arnal, Nahuel Roselló, Alejandro Pérez-González-de-Martos, Jorge Civera, Albert Sanchis, Alfons Juan

Towards Automatic Speech to Sign Language Generation
Parul Kapoor, Rudrabha Mukhopadhyay, Sindhu B. Hegde, Vinay Namboodiri, C.V. Jawahar

kosp2e: Korean Speech to English Translation Corpus
Won Ik Cho, Seok Min Kim, Hyunchang Cho, Nam Soo Kim

speechocean762: An Open-Source Non-Native English Speech Corpus for Pronunciation Assessment
Junbo Zhang, Zhiwen Zhang, Yongqing Wang, Zhiyong Yan, Qiong Song, Yukai Huang, Ke Li, Daniel Povey, Yujun Wang


Non-Autoregressive Sequential Modeling for Speech Processing


An Improved Single Step Non-Autoregressive Transformer for Automatic Speech Recognition
Ruchao Fan, Wei Chu, Peng Chang, Jing Xiao, Abeer Alwan

Multi-Speaker ASR Combining Non-Autoregressive Conformer CTC and Conditional Speaker Chain
Pengcheng Guo, Xuankai Chang, Shinji Watanabe, Lei Xie

Pushing the Limits of Non-Autoregressive Speech Recognition
Edwin G. Ng, Chung-Cheng Chiu, Yu Zhang, William Chan

Non-Autoregressive Predictive Coding for Learning Speech Representations from Local Dependencies
Alexander H. Liu, Yu-An Chung, James Glass

Relaxing the Conditional Independence Assumption of CTC-Based ASR by Conditioning on Intermediate Predictions
Jumon Nozaki, Tatsuya Komatsu

Toward Streaming ASR with Non-Autoregressive Insertion-Based Model
Yuya Fujita, Tianzi Wang, Shinji Watanabe, Motoi Omachi

Layer Pruning on Demand with Intermediate CTC
Jaesong Lee, Jingu Kang, Shinji Watanabe

Real-Time End-to-End Monaural Multi-Speaker Speech Recognition
Song Li, Beibei Ouyang, Fuchuan Tong, Dexin Liao, Lin Li, Qingyang Hong

Streaming End-to-End ASR Based on Blockwise Non-Autoregressive Models
Tianzi Wang, Yuya Fujita, Xuankai Chang, Shinji Watanabe

TalkNet: Non-Autoregressive Depth-Wise Separable Convolutional Model for Speech Synthesis
Stanislav Beliaev, Boris Ginsburg

WaveGrad 2: Iterative Refinement for Text-to-Speech Synthesis
Nanxin Chen, Yu Zhang, Heiga Zen, Ron J. Weiss, Mohammad Norouzi, Najim Dehak, William Chan

Align-Denoise: Single-Pass Non-Autoregressive Speech Recognition
Nanxin Chen, Piotr Żelasko, Laureano Moro-Velázquez, Jesús Villalba, Najim Dehak

VAENAR-TTS: Variational Auto-Encoder Based Non-AutoRegressive Text-to-Speech Synthesis
Hui Lu, Zhiyong Wu, Xixin Wu, Xu Li, Shiyin Kang, Xunying Liu, Helen Meng


The ADReSSo Challenge: Detecting Cognitive Decline Using Speech Only


Detecting Cognitive Decline Using Speech Only: The ADReSSo Challenge
Saturnino Luz, Fasih Haider, Sofia de la Fuente, Davida Fromm, Brian MacWhinney

Influence of the Interviewer on the Automatic Assessment of Alzheimer’s Disease in the Context of the ADReSSo Challenge
P.A. Pérez-Toro, S.P. Bayerl, T. Arias-Vergara, J.C. Vásquez-Correa, P. Klumpp, M. Schuster, Elmar Nöth, J.R. Orozco-Arroyave, K. Riedhammer

WavBERT: Exploiting Semantic and Non-Semantic Speech Using Wav2vec and BERT for Dementia Detection
Youxiang Zhu, Abdelrahman Obyat, Xiaohui Liang, John A. Batsis, Robert M. Roth

Alzheimer Disease Recognition Using Speech-Based Embeddings From Pre-Trained Models
Lara Gauder, Leonardo Pepino, Luciana Ferrer, Pablo Riera

Comparing Acoustic-Based Approaches for Alzheimer’s Disease Detection
Aparna Balagopalan, Jekaterina Novikova

Alzheimer’s Disease Detection from Spontaneous Speech Through Combining Linguistic Complexity and (Dis)Fluency Features with Pretrained Language Models
Yu Qiao, Xuefeng Yin, Daniel Wiechmann, Elma Kerz

Using the Outputs of Different Automatic Speech Recognition Paradigms for Acoustic- and BERT-Based Alzheimer’s Dementia Detection Through Spontaneous Speech
Yilin Pan, Bahman Mirheidari, Jennifer M. Harris, Jennifer C. Thompson, Matthew Jones, Julie S. Snowden, Daniel Blackburn, Heidi Christensen

Tackling the ADRESSO Challenge 2021: The MUET-RMIT System for Alzheimer’s Dementia Recognition from Spontaneous Speech
Zafi Sherhan Syed, Muhammad Shehram Shah Syed, Margaret Lech, Elena Pirogova

Alzheimer’s Dementia Recognition Using Acoustic, Lexical, Disfluency and Speech Pause Features Robust to Noisy Inputs
Morteza Rohanian, Julian Hough, Matthew Purver

Automatic Detection and Assessment of Alzheimer Disease Using Speech and Language Technologies in Low-Resource Scenarios
Raghavendra Pappagari, Jaejin Cho, Sonal Joshi, Laureano Moro-Velázquez, Piotr Żelasko, Jesús Villalba, Najim Dehak

Automatic Detection of Alzheimer’s Disease Using Spontaneous Speech Only
Jun Chen, Jieping Ye, Fengyi Tang, Jiayu Zhou

Modular Multi-Modal Attention Network for Alzheimer’s Disease Detection Using Patient Audio and Language Data
Ning Wang, Yupeng Cao, Shuai Hao, Zongru Shao, K.P. Subbalakshmi





Non-Native Speech


Cross-Linguistic Perception of the Japanese Singleton/Geminate Contrast: Korean, Mandarin and Mongolian Compared
Kimiko Tsukada, Yurong, Joo-Yeon Kim, Jeong-Im Han, John Hajek

Detection of Lexical Stress Errors in Non-Native (L2) English with Data Augmentation and Attention
Daniel Korzekwa, Roberto Barra-Chicote, Szymon Zaporowski, Grzegorz Beringer, Jaime Lorenzo-Trueba, Alicja Serafinowicz, Jasha Droppo, Thomas Drugman, Bozena Kostek

Testing Acoustic Voice Quality Classification Across Languages and Speech Styles
Bettina Braun, Nicole Dehé, Marieke Einfeldt, Daniela Wochner, Katharina Zahner-Ritter

Acquisition of Prosodic Focus Marking by Three- to Six-Year-Old Children Learning Mandarin Chinese
Qianyutong Zhang, Kexin Lyu, Zening Chen, Ping Tang

Adaptive Listening Difficulty Detection for L2 Learners Through Moderating ASR Resources
Maryam Sadat Mirzaei, Kourosh Meshgi

F0 Patterns of L2 English Speech by Mandarin Chinese Learners
Hongwei Ding, Binghuai Lin, Liyuan Wang

A Neural Network-Based Noise Compensation Method for Pronunciation Assessment
Binghuai Lin, Liyuan Wang

Phonetic Distance and Surprisal in Multilingual Priming: Evidence from Slavic
Jacek Kudera, Philip Georgis, Bernd Möbius, Tania Avgustinova, Dietrich Klakow

A Preliminary Study on Discourse Prosody Encoding in L1 and L2 English Spontaneous Narratives
Yuqing Zhang, Zhu Li, Binghuai Lin, Jinsong Zhang

Transformer Based End-to-End Mispronunciation Detection and Diagnosis
Minglin Wu, Kun Li, Wai-Kim Leung, Helen Meng

L1 Identification from L2 Speech Using Neural Spectrogram Analysis
Calbert Graham


Phonetics II


Leveraging Real-Time MRI for Illuminating Linguistic Velum Action
Miran Oh, Dani Byrd, Shrikanth S. Narayanan

Segmental Alignment of English Syllables with Singleton and Cluster Onsets
Zirui Liu, Yi Xu

Exploration of Welsh English Pre-Aspiration: How Wide-Spread is it?
Míša Hejná

Revisiting Recall Effects of Filler Particles in German and English
Beeke Muhlack, Mikey Elmers, Heiner Drenhaus, Jürgen Trouvain, Marjolein van Os, Raphael Werner, Margarita Ryzhova, Bernd Möbius

How Reliable Are Phonetic Data Collected Remotely? Comparison of Recording Devices and Environments on Acoustic Measurements
Chunyu Ge, Yixuan Xiong, Peggy Mok

A Cross-Dialectal Comparison of Apical Vowels in Beijing Mandarin, Northeastern Mandarin and Southwestern Mandarin: An EMA and Ultrasound Study
Jing Huang, Feng-fan Hsieh, Yueh-chin Chang

Dissecting the Aero-Acoustic Parameters of Open Articulatory Transitions
Mark Gibson, Oihane Muxika, Marianne Pouplier

Quantifying Vocal Tract Shape Variation and its Acoustic Impact: A Geometric Morphometric Approach
Amelia J. Gully

Speech Perception and Loanword Adaptations: The Case of Copy-Vowel Epenthesis
Adriana Guevara-Rukoz, Shi Yu, Sharon Peperkamp

Speakers Coarticulate Less When Facing Real and Imagined Communicative Difficulties: An Analysis of Read and Spontaneous Speech from the LUCID Corpus
Zhe-chen Guo, Rajka Smiljanic

Developmental Changes of Vowel Acoustics in Adolescents
Einar Meister, Lya Meister

Context and Co-Text Influence on the Accuracy Production of Italian L2 Non-Native Sounds
Sonia d'Apolito, Barbara Gili Fivela

A New Vowel Normalization for Sociophonetics
Wilbert Heeringa, Hans Van de Velde

The Pacific Expansion: Optimizing Phonetic Transcription of Archival Corpora
Rosey Billington, Hywel Stoakes, Nick Thieberger


Search/Decoding Techniques and Confidence Measures for ASR


FSR: Accelerating the Inference Process of Transducer-Based Models by Applying Fast-Skip Regularization
Zhengkun Tian, Jiangyan Yi, Ye Bai, Jianhua Tao, Shuai Zhang, Zhengqi Wen

LT-LM: A Novel Non-Autoregressive Language Model for Single-Shot Lattice Rescoring
Anton Mitrofanov, Mariya Korenevskaya, Ivan Podluzhny, Yuri Khokhlov, Aleksandr Laptev, Andrei Andrusenko, Aleksei Ilin, Maxim Korenevsky, Ivan Medennikov, Aleksei Romanenko

A Hybrid Seq-2-Seq ASR Design for On-Device and Server Applications
Cyril Allauzen, Ehsan Variani, Michael Riley, David Rybach, Hao Zhang

VAD-Free Streaming Hybrid CTC/Attention ASR for Unsegmented Recording
Hirofumi Inaguma, Tatsuya Kawahara

WeNet: Production Oriented Streaming and Non-Streaming End-to-End Speech Recognition Toolkit
Zhuoyuan Yao, Di Wu, Xiong Wang, Binbin Zhang, Fan Yu, Chao Yang, Zhendong Peng, Xiaoyu Chen, Lei Xie, Xin Lei

Cross-Modal Transformer-Based Neural Correction Models for Automatic Speech Recognition
Tomohiro Tanaka, Ryo Masumura, Mana Ihori, Akihiko Takashima, Takafumi Moriya, Takanori Ashihara, Shota Orihashi, Naoki Makishima

Deep Neural Network Calibration for E2E Speech Recognition System
Mun-Hak Lee, Joon-Hyuk Chang

Residual Energy-Based Models for End-to-End Speech Recognition
Qiujia Li, Yu Zhang, Bo Li, Liangliang Cao, Philip C. Woodland

Multi-Task Learning for End-to-End ASR Word and Utterance Confidence with Deletion Prediction
David Qiu, Yanzhang He, Qiujia Li, Yu Zhang, Liangliang Cao, Ian McGraw

Insights on Neural Representations for End-to-End Speech Recognition
Anna Ollerenshaw, Md. Asif Jalal, Thomas Hain

Sequence-Level Confidence Classifier for ASR Utterance Accuracy and Application to Acoustic Models
Amber Afshan, Kshitiz Kumar, Jian Wu



Speech Type Classification and Diagnosis


An Agent for Competing with Humans in a Deceptive Game Based on Vocal Cues
Noa Mansbach, Evgeny Hershkovitch Neiterman, Amos Azaria

A Multi-Branch Deep Learning Network for Automated Detection of COVID-19
Ahmed Fakhry, Xinyi Jiang, Jaclyn Xiao, Gunvant Chaudhari, Asriel Han

RW-Resnet: A Novel Speech Anti-Spoofing Model Using Raw Waveform
Youxuan Ma, Zongze Ren, Shugong Xu

Fake Audio Detection in Resource-Constrained Settings Using Microfeatures
Hira Dhamyal, Ayesha Ali, Ihsan Ayyub Qazi, Agha Ali Raza

Coughing-Based Recognition of Covid-19 with Spatial Attentive ConvLSTM Recurrent Neural Networks
Tianhao Yan, Hao Meng, Emilia Parada-Cabaleiro, Shuo Liu, Meishu Song, Björn W. Schuller

Knowledge Distillation for Singing Voice Detection
Soumava Paul, Gurunath Reddy M, K. Sreenivasa Rao, Partha Pratim Das

Age Estimation with Speech-Age Model for Heterogeneous Speech Datasets
Ryu Takeda, Kazunori Komatani

Open-Set Audio Classification with Limited Training Resources Based on Augmentation Enhanced Variational Auto-Encoder GAN with Detection-Classification Joint Training
Kah Kuan Teh, Huy Dat Tran

Deep Spectral-Cepstral Fusion for Shouted and Normal Speech Classification
Takahiro Fukumori

Automatic Detection of Shouted Speech Segments in Indian News Debates
Shikha Baghel, Mrinmoy Bhattacharjee, S.R. Mahadeva Prasanna, Prithwijit Guha

Generalized Spoofing Detection Inspired from Audio Generation Artifacts
Yang Gao, Tyler Vuong, Mahsa Elyasi, Gaurav Bharaj, Rita Singh

Overlapped Speech Detection Based on Spectral and Spatial Feature Fusion
Weiguang Chen, Van Tung Pham, Eng Siong Chng, Xionghu Zhong


Spoken Term Detection & Voice Search


Do Acoustic Word Embeddings Capture Phonological Similarity? An Empirical Study
Badr M. Abdullah, Marius Mosbach, Iuliia Zaitova, Bernd Möbius, Dietrich Klakow

Paraphrase Label Alignment for Voice Application Retrieval in Spoken Language Understanding
Zheng Gao, Radhika Arava, Qian Hu, Xibin Gao, Thahir Mohamed, Wei Xiao, Mohamed AbdelHady

Personalized Keyphrase Detection Using Speaker and Environment Information
Rajeev Rikhye, Quan Wang, Qiao Liang, Yanzhang He, Ding Zhao, Yiteng Huang, Arun Narayanan, Ian McGraw

Streaming Transformer for Hardware Efficient Voice Trigger Detection and False Trigger Mitigation
Vineet Garg, Wonil Chang, Siddharth Sigtia, Saurabh Adya, Pramod Simha, Pranay Dighe, Chandra Dhir

Few-Shot Keyword Spotting in Any Language
Mark Mazumder, Colby Banbury, Josh Meyer, Pete Warden, Vijay Janapa Reddi

Text Anchor Based Metric Learning for Small-Footprint Keyword Spotting
Li Wang, Rongzhi Gu, Nuo Chen, Yuexian Zou

A Meta-Learning Approach for User-Defined Spoken Term Classification with Varying Classes and Examples
Yangbin Chen, Tom Ko, Jianping Wang

Auxiliary Sequence Labeling Tasks for Disfluency Detection
Dongyub Lee, Byeongil Ko, Myeong Cheol Shin, Taesun Whang, Daniel Lee, Eunhwa Kim, Eunggyun Kim, Jaechoon Jo

Energy-Friendly Keyword Spotting System Using Add-Based Convolution
Hang Zhou, Wenchao Hu, Yu Ting Yeung, Xiao Chen

The 2020 Personalized Voice Trigger Challenge: Open Datasets, Evaluation Metrics, Baseline System and Results
Yan Jia, Xingming Wang, Xiaoyi Qin, Yinping Zhang, Xuyang Wang, Junjie Wang, Dong Zhang, Ming Li

Auto-KWS 2021 Challenge: Task, Datasets, and Baselines
Jingsong Wang, Yuxuan He, Chunyu Zhao, Qijie Shao, Wei-Wei Tu, Tom Ko, Hung-yi Lee, Lei Xie

Keyword Transformer: A Self-Attention Model for Keyword Spotting
Axel Berg, Mark O’Connor, Miguel Tairum Cruz

Teaching Keyword Spotters to Spot New Keywords with Limited Examples
Abhijeet Awasthi, Kevin Kilgour, Hassan Rom


Voice Anti-Spoofing and Countermeasure


A Comparative Study on Recent Neural Spoofing Countermeasures for Synthetic Speech Detection
Xin Wang, Junichi Yamagishi

An Initial Investigation for Detecting Partially Spoofed Audio
Lin Zhang, Xin Wang, Erica Cooper, Junichi Yamagishi, Jose Patino, Nicholas Evans

Siamese Network with wav2vec Feature for Spoofing Speech Detection
Yang Xie, Zhenchuan Zhang, Yingchun Yang

Cross-Database Replay Detection in Terminal-Dependent Speaker Verification
Xingliang Cheng, Mingxing Xu, Thomas Fang Zheng

The Effect of Silence and Dual-Band Fusion in Anti-Spoofing System
Yuxiang Zhang, Wenchao Wang, Pengyuan Zhang

Pairing Weak with Strong: Twin Models for Defending Against Adversarial Attack on Speaker Verification
Zhiyuan Peng, Xu Li, Tan Lee

Attention-Based Convolutional Neural Network for ASV Spoofing Detection
Hefei Ling, Leichao Huang, Junrui Huang, Baiyan Zhang, Ping Li

Voting for the Right Answer: Adversarial Defense for Speaker Verification
Haibin Wu, Yang Zhang, Zhiyong Wu, Dong Wang, Hung-yi Lee

Visualizing Classifier Adjacency Relations: A Case Study in Speaker Verification and Voice Anti-Spoofing
Tomi Kinnunen, Andreas Nautsch, Md. Sahidullah, Nicholas Evans, Xin Wang, Massimiliano Todisco, Héctor Delgado, Junichi Yamagishi, Kong Aik Lee

Representation Learning to Classify and Detect Adversarial Attacks Against Speaker and Speech Recognition Systems
Jesús Villalba, Sonal Joshi, Piotr Żelasko, Najim Dehak

An Empirical Study on Channel Effects for Synthetic Voice Spoofing Countermeasure Systems
You Zhang, Ge Zhu, Fei Jiang, Zhiyao Duan

Channel-Wise Gated Res2Net: Towards Robust Detection of Synthetic Speech Attacks
Xu Li, Xixin Wu, Hui Lu, Xunying Liu, Helen Meng

Partially-Connected Differentiable Architecture Search for Deepfake and Spoofing Detection
Wanying Ge, Michele Panariello, Jose Patino, Massimiliano Todisco, Nicholas Evans







Applications in Transcription, Education and Learning


Weakly-Supervised Word-Level Pronunciation Error Detection in Non-Native English Speech
Daniel Korzekwa, Jaime Lorenzo-Trueba, Thomas Drugman, Shira Calamaro, Bozena Kostek

End-to-End Speaker-Attributed ASR with Transformer
Naoyuki Kanda, Guoli Ye, Yashesh Gaur, Xiaofei Wang, Zhong Meng, Zhuo Chen, Takuya Yoshioka

Understanding Medical Conversations: Rich Transcription, Confidence Scores & Information Extraction
Hagen Soltau, Mingqiu Wang, Izhak Shafran, Laurent El Shafey

Phone-Level Pronunciation Scoring for Spanish Speakers Learning English Using a GOP-DNN System
Jazmín Vidal, Cyntia Bonomi, Marcelo Sancinetti, Luciana Ferrer

Explore wav2vec 2.0 for Mispronunciation Detection
Xiaoshuo Xu, Yueteng Kang, Songjun Cao, Binghuai Lin, Long Ma

Lexical Density Analysis of Word Productions in Japanese English Using Acoustic Word Embeddings
Shintaro Ando, Nobuaki Minematsu, Daisuke Saito

Deep Feature Transfer Learning for Automatic Pronunciation Assessment
Binghuai Lin, Liyuan Wang

Multilingual Speech Evaluation: Case Studies on English, Malay and Tamil
Huayun Zhang, Ke Shi, Nancy F. Chen

A Study on Fine-Tuning wav2vec2.0 Model for the Task of Mispronunciation Detection and Diagnosis
Linkai Peng, Kaiqi Fu, Binghuai Lin, Dengfeng Ke, Jinsong Zhan

The Impact of ASR on the Automatic Analysis of Linguistic Complexity and Sophistication in Spontaneous L2 Speech
Yu Qiao, Wei Zhou, Elma Kerz, Ralf Schlüter

End-to-End Rich Transcription-Style Automatic Speech Recognition with Semi-Supervised Learning
Tomohiro Tanaka, Ryo Masumura, Mana Ihori, Akihiko Takashima, Shota Orihashi, Naoki Makishima

“You don’t understand me!”: Comparing ASR Results for L1 and L2 Speakers of Swedish
Ronald Cumbal, Birger Moell, José Lopes, Olov Engwall

NeMo Inverse Text Normalization: From Development to Production
Yang Zhang, Evelina Bakhturina, Kyle Gorman, Boris Ginsburg

Improvement of Automatic English Pronunciation Assessment with Small Number of Utterances Using Sentence Speakability
Satsuki Naijo, Akinori Ito, Takashi Nose



Resource-Constrained ASR


Compressing 1D Time-Channel Separable Convolutions Using Sparse Random Ternary Matrices
Gonçalo Mordido, Matthijs Van keirsbilck, Alexander Keller

Weakly Supervised Construction of ASR Systems from Massive Video Data
Mengli Cheng, Chengyu Wang, Jun Huang, Xiaobo Wang

Broadcasted Residual Learning for Efficient Keyword Spotting
Byeonggeun Kim, Simyung Chang, Jinkyu Lee, Dooyong Sung

CoDERT: Distilling Encoder Representations with Co-Learning for Transducer-Based Speech Recognition
Rupak Vignesh Swaminathan, Brian King, Grant P. Strimel, Jasha Droppo, Athanasios Mouchtaris

Extremely Low Footprint End-to-End ASR System for Smart Device
Zhifu Gao, Yiwu Yao, Shiliang Zhang, Jun Yang, Ming Lei, Ian McLoughlin

Dissecting User-Perceived Latency of On-Device E2E Speech Recognition
Yuan Shangguan, Rohit Prabhavalkar, Hang Su, Jay Mahadeokar, Yangyang Shi, Jiatong Zhou, Chunyang Wu, Duc Le, Ozlem Kalinli, Christian Fuegen, Michael L. Seltzer

Amortized Neural Networks for Low-Latency Speech Recognition
Jonathan Macoskey, Grant P. Strimel, Jinru Su, Ariya Rastrow

Tied & Reduced RNN-T Decoder
Rami Botros, Tara N. Sainath, Robert David, Emmanuel Guzman, Wei Li, Yanzhang He

PQK: Model Compression via Pruning, Quantization, and Knowledge Distillation
Jangho Kim, Simyung Chang, Nojun Kwak

Collaborative Training of Acoustic Encoders for Speech Recognition
Varun Nagaraja, Yangyang Shi, Ganesh Venkatesh, Ozlem Kalinli, Michael L. Seltzer, Vikas Chandra

Efficient Conformer with Prob-Sparse Attention Mechanism for End-to-End Speech Recognition
Xiong Wang, Sining Sun, Lei Xie, Long Ma

The Energy and Carbon Footprint of Training End-to-End Speech Recognizers
Titouan Parcollet, Mirco Ravanelli



Speech Synthesis: Speaking Style and Emotion


STYLER: Style Factor Modeling with Rapidity and Robustness via Speech Decomposition for Expressive and Controllable Neural Text to Speech
Keon Lee, Kyumin Park, Daeyoung Kim

Reinforcement Learning for Emotional Text-to-Speech Synthesis with Improved Emotion Discriminability
Rui Liu, Berrak Sisman, Haizhou Li

Emotional Prosody Control for Speech Generation
Sarath Sivaprasad, Saiteja Kosgi, Vineet Gandhi

Controllable Context-Aware Conversational Speech Synthesis
Jian Cong, Shan Yang, Na Hu, Guangzhi Li, Lei Xie, Dan Su

Expressive Text-to-Speech Using Style Tag
Minchan Kim, Sung Jun Cheon, Byoung Jin Choi, Jong Jin Kim, Nam Soo Kim

Adaptive Text to Speech for Spontaneous Style
Yuzi Yan, Xu Tan, Bohan Li, Guangyan Zhang, Tao Qin, Sheng Zhao, Yuan Shen, Wei-Qiang Zhang, Tie-Yan Liu

Towards Multi-Scale Style Control for Expressive Speech Synthesis
Xiang Li, Changhe Song, Jingbei Li, Zhiyong Wu, Jia Jia, Helen Meng

Cross-Speaker Style Transfer with Prosody Bottleneck in Neural Speech Synthesis
Shifeng Pan, Lei He

Fine-Grained Style Modeling, Transfer and Prediction in Text-to-Speech Synthesis via Phone-Level Content-Style Disentanglement
Daxin Tan, Tan Lee

Improving Performance of Seen and Unseen Speech Style Transfer in End-to-End Neural TTS
Xiaochun An, Frank K. Soong, Lei Xie

Synthesis of Expressive Speaking Styles with Limited Training Data in a Multi-Speaker, Prosody-Controllable Sequence-to-Sequence Architecture
Slava Shechtman, Raul Fernandez, Alexander Sorin, David Haws


Spoken Language Understanding II


Intent Detection and Slot Filling for Vietnamese
Mai Hoang Dao, Thinh Hung Truong, Dat Quoc Nguyen

Augmenting Slot Values and Contexts for Spoken Language Understanding with Pretrained Models
Haitao Lin, Lu Xiang, Yu Zhou, Jiajun Zhang, Chengqing Zong

The Impact of Intent Distribution Mismatch on Semi-Supervised Spoken Language Understanding
Judith Gaspers, Quynh Do, Daniil Sorokin, Patrick Lehnen

Knowledge Distillation from BERT Transformer to Speech Transformer for Intent Classification
Yidi Jiang, Bidisha Sharma, Maulik Madhavi, Haizhou Li

Three-Module Modeling For End-to-End Spoken Language Understanding Using Pre-Trained DNN-HMM-Based Acoustic-Phonetic Model
Nick J.C. Wang, Lu Wang, Yandan Sun, Haimei Kang, Dejun Zhang

Speak or Chat with Me: End-to-End Spoken Language Understanding System with Flexible Inputs
Sujeong Cha, Wangrui Hou, Hyun Jung, My Phung, Michael Picheny, Hong-Kwang J. Kuo, Samuel Thomas, Edmilson Morais

End-to-End Cross-Lingual Spoken Language Understanding Model with Multilingual Pretraining
Xianwei Zhang, Liang He

Factorization-Aware Training of Transformers for Natural Language Understanding on the Edge
Hamidreza Saghir, Samridhi Choudhary, Sepehr Eghbali, Clement Chung

End-to-End Spoken Language Understanding for Generalized Voice Assistants
Michael Saxon, Samridhi Choudhary, Joseph P. McKenna, Athanasios Mouchtaris

Bi-Directional Joint Neural Networks for Intent Classification and Slot Filling
Soyeon Caren Han, Siqu Long, Huichun Li, Henry Weld, Josiah Poon



Speech Recognition of Atypical Speech


Automatic Speech Recognition of Disordered Speech: Personalized Models Outperforming Human Listeners on Short Phrases
Jordan R. Green, Robert L. MacDonald, Pan-Pan Jiang, Julie Cattiau, Rus Heywood, Richard Cave, Katie Seaver, Marilyn A. Ladewig, Jimmy Tobin, Michael P. Brenner, Philip C. Nelson, Katrin Tomanek

Investigating the Utility of Multimodal Conversational Technology and Audiovisual Analytic Measures for the Assessment and Monitoring of Amyotrophic Lateral Sclerosis at Scale
Michael Neumann, Oliver Roesler, Jackson Liscombe, Hardik Kothare, David Suendermann-Oeft, David Pautler, Indu Navar, Aria Anvar, Jochen Kumm, Raquel Norel, Ernest Fraenkel, Alexander V. Sherman, James D. Berry, Gary L. Pattee, Jun Wang, Jordan R. Green, Vikram Ramanarayanan

Handling Acoustic Variation in Dysarthric Speech Recognition Systems Through Model Combination
Enno Hermann, Mathew Magimai-Doss

Spectro-Temporal Deep Features for Disordered Speech Assessment and Recognition
Mengzhe Geng, Shansong Liu, Jianwei Yu, Xurong Xie, Shoukang Hu, Zi Ye, Zengrui Jin, Xunying Liu, Helen Meng

Speaking with a KN95 Face Mask: ASR Performance and Speaker Compensation
Sarah E. Gutz, Hannah P. Rowe, Jordan R. Green

Adversarial Data Augmentation for Disordered Speech Recognition
Zengrui Jin, Mengzhe Geng, Xurong Xie, Jianwei Yu, Shansong Liu, Xunying Liu, Helen Meng

Variational Auto-Encoder Based Variability Encoding for Dysarthric Speech Recognition
Xurong Xie, Rukiye Ruzi, Xunying Liu, Lan Wang

Learning Explicit Prosody Models and Deep Speaker Embeddings for Atypical Voice Conversion
Disong Wang, Songxiang Liu, Lifa Sun, Xixin Wu, Xunying Liu, Helen Meng

Bayesian Parametric and Architectural Domain Adaptation of LF-MMI Trained TDNNs for Elderly and Dysarthric Speech Recognition
Jiajun Deng, Fabian Ritter Gutierrez, Shoukang Hu, Mengzhe Geng, Xurong Xie, Zi Ye, Shansong Liu, Jianwei Yu, Xunying Liu, Helen Meng

A Voice-Activated Switch for Persons with Motor and Speech Impairments: Isolated-Vowel Spotting Using Neural Networks
Shanqing Cai, Lisie Lillianfeld, Katie Seaver, Jordan R. Green, Michael P. Brenner, Philip C. Nelson, D. Sculley

Conformer Parrotron: A Faster and Stronger End-to-End Speech Conversion and Recognition Model for Atypical Speech
Zhehuai Chen, Bhuvana Ramabhadran, Fadi Biadsy, Xia Zhang, Youzheng Chen, Liyang Jiang, Fang Chu, Rohan Doshi, Pedro J. Moreno

Disordered Speech Data Collection: Lessons Learned at 1 Million Utterances from Project Euphonia
Robert L. MacDonald, Pan-Pan Jiang, Julie Cattiau, Rus Heywood, Richard Cave, Katie Seaver, Marilyn A. Ladewig, Jimmy Tobin, Michael P. Brenner, Philip C. Nelson, Jordan R. Green, Katrin Tomanek

Automatic Severity Classification of Korean Dysarthric Speech Using Phoneme-Level Pronunciation Features
Eun Jung Yeo, Sunhee Kim, Minhwa Chung

Comparing Supervised Models and Learned Speech Representations for Classifying Intelligibility of Disordered Speech on Selected Phrases
Subhashini Venugopalan, Joel Shor, Manoj Plakal, Jimmy Tobin, Katrin Tomanek, Jordan R. Green, Michael P. Brenner

Analysis and Tuning of a Voice Assistant System for Dysfluent Speech
Vikramjit Mitra, Zifang Huang, Colin Lea, Lauren Tooley, Sarah Wu, Darren Botten, Ashwini Palekar, Shrinath Thelapurath, Panayiotis Georgiou, Sachin Kajarekar, Jefferey Bigham



×

Speech Synthesis: Other Topics

Disordered Speech

Speech Signal Analysis and Representation II

Feature, Embedding and Neural Architecture for Speaker Recognition

Speech Synthesis: Toward End-to-End Synthesis II

Speech Enhancement and Intelligibility

Spoken Dialogue Systems I

Topics in ASR: Robustness, Feature Extraction, and Far-Field ASR

Voice Activity Detection and Keyword Spotting

Voice and Voicing

The INTERSPEECH 2021 Computational Paralinguistics Challenge (ComParE) — COVID-19 Cough, COVID-19 Speech, Escalation & Primates

Survey Talk 1: Heidi Christensen

Embedding and Network Architecture for Speaker Recognition

Speech Perception I

Acoustic Event Detection and Acoustic Scene Classification

Diverse Modes of Speech Acquisition and Processing

Multi-Channel Speech Enhancement and Hearing Aids

Self-Supervision and Semi-Supervision for Neural ASR Training

Spoken Language Processing I

Voice Conversion and Adaptation II

Privacy-Preserving Machine Learning for Audio & Speech Processing

The First DiCOVA Challenge: Diagnosis of COVID-19 Using Acoustics

Show and Tell 1

Keynote 1: Hermann Ney

ASR Technologies and Systems

Phonation and Voicing

Health and Affect I

Robust Speaker Recognition

Source Separation, Dereverberation and Echo Cancellation

Speech Signal Analysis and Representation I

Spoken Language Understanding I

Topics in ASR: Adaptation, Transfer Learning, Children’s Speech, and Low-Resource Settings

Voice Conversion and Adaptation I

Voice Quality Characterization for Clinical Voice Assessment: Voice Production, Acoustics, and Auditory Perception

Miscellanous Topics in ASR

Phonetics I

Target Speaker Detection, Localization and Separation

Language and Accent Recognition

Low-Resource Speech Recognition

Speech Synthesis: Singing, Multimodal, Crosslingual Synthesis

Speech Coding and Privacy

Speech Perception II

Streaming for ASR/RNN Transducers

ConferencingSpeech 2021 Challenge: Far-Field Multi-Channel Speech Enhancement for Video Conferencing

Survey Talk 2: Sriram Ganapathy

Keynote 2: Pascale Fung

Language Modeling and Text-Based Innovations for ASR

Speaker, Language, and Privacy

Assessment of Pathological Speech and Language I

Communication and Interaction, Multimodality

Language and Lexical Modeling for ASR

Novel Neural Network Architectures for ASR

Speech Localization, Enhancement, and Quality Assessment

Speech Synthesis: Neural Waveform Generation

Spoken Machine Translation

SdSV Challenge 2021: Analysis and Exploration of New Ideas on Short-Duration Speaker Verification

Show and Tell 2

Graph and End-to-End Learning for Speaker Recognition

Spoken Language Processing II

Speech and Audio Analysis

Cross/Multi-Lingual and Code-Switched ASR

Health and Affect II

Neural Network Training Methods for ASR

Prosodic Features and Structure

Single-Channel Speech Enhancement

Speech Synthesis: Tools, Data, Evaluation

INTERSPEECH 2021 Deep Noise Suppression Challenge

Neural Network Training Methods and Architectures for ASR

Emotion and Sentiment Analysis I

Linguistic Components in End-to-End ASR

Assessment of Pathological Speech and Language II

Multimodal Systems

Source Separation I

Speaker Diarization I

Speech Synthesis: Prosody Modeling I

Speech Production II

Spoken Dialogue Systems II

Oriental Language Recognition

Automatic Speech Recognition in Air Traffic Management

Show and Tell 3

Survey Talk 3: Karen Livescu

Keynote 3: Mounya Elhilali

Speech Production I

Speech Enhancement and Coding

Emotion and Sentiment Analysis II

Multi- and Cross-Lingual ASR, Other Topics in ASR

Source Separation II

Speaker Diarization II

Speech Synthesis: Toward End-to-End Synthesis I

Tools, Corpora and Resources

Non-Autoregressive Sequential Modeling for Speech Processing

The ADReSSo Challenge: Detecting Cognitive Decline Using Speech Only

Robust and Far-Field ASR

Speech Synthesis: Prosody Modeling II

Source Separation III

Non-Native Speech

Phonetics II

Search/Decoding Techniques and Confidence Measures for ASR

Speech Synthesis: Linguistic Processing, Paradigms and Other Topics

Speech Type Classification and Diagnosis

Spoken Term Detection & Voice Search

Voice Anti-Spoofing and Countermeasure

OpenASR20 and Low Resource ASR Development

Survey Talk 4: Alejandrina Cristia

Keynote 4: Tomáš Mikolov

Voice Activity Detection

Keyword Search and Spoken Language Processing

Applications in Transcription, Education and Learning

Emotion and Sentiment Analysis III

Resource-Constrained ASR

Speaker Recognition: Applications

Speech Synthesis: Speaking Style and Emotion

Spoken Language Understanding II

INTERSPEECH 2021 Acoustic Echo Cancellation Challenge

Speech Recognition of Atypical Speech

Show and Tell 4