ISCA Archive Interspeech 2020 Sessions Website
  ISCA Archive Sessions Website
top

Interspeech 2020

Shanghai, China
25-29 October 2020

General Chair: Helen Meng, General Co-Chairs: Bo Xu and Thomas Zheng
doi: 10.21437/Interspeech.2020

ASR Neural Network Architectures I


On the Comparison of Popular End-to-End Models for Large Scale Speech Recognition
Jinyu Li, Yu Wu, Yashesh Gaur, Chengyi Wang, Rui Zhao, Shujie Liu

SAN-M: Memory Equipped Self-Attention for End-to-End Speech Recognition
Zhifu Gao, Shiliang Zhang, Ming Lei, Ian McLoughlin

Contextual RNN-T for Open Domain ASR
Mahaveer Jain, Gil Keren, Jay Mahadeokar, Geoffrey Zweig, Florian Metze, Yatharth Saraf

ASAPP-ASR: Multistream CNN and Self-Attentive SRU for SOTA Speech Recognition
Jing Pan, Joshua Shapiro, Jeremy Wohlwend, Kyu J. Han, Tao Lei, Tao Ma

Compressing LSTM Networks with Hierarchical Coarse-Grain Sparsity
Deepak Kadetotad, Jian Meng, Visar Berisha, Chaitali Chakrabarti, Jae-sun Seo

BLSTM-Driven Stream Fusion for Automatic Speech Recognition: Novel Methods and a Multi-Size Window Fusion Example
Timo Lohrenz, Tim Fingscheidt

Relative Positional Encoding for Speech Recognition and Direct Translation
Ngoc-Quan Pham, Thanh-Le Ha, Tuan-Nam Nguyen, Thai-Son Nguyen, Elizabeth Salesky, Sebastian Stüker, Jan Niehues, Alex Waibel

Joint Speaker Counting, Speech Recognition, and Speaker Identification for Overlapped Speech of any Number of Speakers
Naoyuki Kanda, Yashesh Gaur, Xiaofei Wang, Zhong Meng, Zhuo Chen, Tianyan Zhou, Takuya Yoshioka

Implicit Transfer of Privileged Acoustic Information in a Generalized Knowledge Distillation Framework
Takashi Fukuda, Samuel Thomas

Effect of Adding Positional Information on Convolutional Neural Networks for End-to-End Speech Recognition
Jinhwan Park, Wonyong Sung


Multi-Channel Speech Enhancement


Deep Neural Network-Based Generalized Sidelobe Canceller for Robust Multi-Channel Speech Recognition
Guanjun Li, Shan Liang, Shuai Nie, Wenju Liu, Zhanlei Yang, Longshuai Xiao

Neural Spatio-Temporal Beamformer for Target Speech Separation
Yong Xu, Meng Yu, Shi-Xiong Zhang, Lianwu Chen, Chao Weng, Jianming Liu, Dong Yu

Online Directional Speech Enhancement Using Geometrically Constrained Independent Vector Analysis
Li Li, Kazuhito Koishida, Shoji Makino

End-to-End Multi-Look Keyword Spotting
Meng Yu, Xuan Ji, Bo Wu, Dan Su, Dong Yu

Differential Beamforming for Uniform Circular Array with Directional Microphones
Weilong Huang, Jinwei Feng

Exploring Deep Hybrid Tensor-to-Vector Network Architectures for Regression Based Speech Enhancement
Jun Qi, Hu Hu, Yannan Wang, Chao-Han Huck Yang, Sabato Marco Siniscalchi, Chin-Hui Lee

An End-to-End Architecture of Online Multi-Channel Speech Separation
Jian Wu, Zhuo Chen, Jinyu Li, Takuya Yoshioka, Zhili Tan, Edward Lin, Yi Luo, Lei Xie

Mentoring-Reverse Mentoring for Unsupervised Multi-Channel Speech Source Separation
Yu Nakagome, Masahito Togami, Tetsuji Ogawa, Tetsunori Kobayashi

Computationally Efficient and Versatile Framework for Joint Optimization of Blind Speech Separation and Dereverberation
Tomohiro Nakatani, Rintaro Ikeshita, Keisuke Kinoshita, Hiroshi Sawada, Shoko Araki

A Space-and-Speaker-Aware Iterative Mask Estimation Approach to Multi-Channel Speech Recognition in the CHiME-6 Challenge
Yan-Hui Tu, Jun Du, Lei Sun, Feng Ma, Jia Pan, Chin-Hui Lee









Speech, Language, and Multimodal Resources


ATCSpeech: A Multilingual Pilot-Controller Speech Corpus from Real Air Traffic Control Environment
Bo Yang, Xianlong Tan, Zhengmao Chen, Bing Wang, Min Ruan, Dan Li, Zhongping Yang, Xiping Wu, Yi Lin

Developing an Open-Source Corpus of Yoruba Speech
Alexander Gutkin, Işın Demirşahin, Oddur Kjartansson, Clara Rivera, Kọ́lá Túbọ̀sún

ClovaCall: Korean Goal-Oriented Dialog Speech Corpus for Automatic Speech Recognition of Contact Centers
Jung-Woo Ha, Kihyun Nam, Jingu Kang, Sang-Woo Lee, Sohee Yang, Hyunhoon Jung, Hyeji Kim, Eunmi Kim, Soojin Kim, Hyun Ah Kim, Kyoungtae Doh, Chan Kyu Lee, Nako Sung, Sunghun Kim

LAIX Corpus of Chinese Learner English: Towards a Benchmark for L2 English ASR
Yanhong Wang, Huan Luan, Jiahong Yuan, Bin Wang, Hui Lin

Design and Development of a Human-Machine Dialog Corpus for the Automated Assessment of Conversational English Proficiency
Vikram Ramanarayanan

CUCHILD: A Large-Scale Cantonese Corpus of Child Speech for Phonology and Articulation Assessment
Si-Ioi Ng, Cymie Wing-Yee Ng, Jiarui Wang, Tan Lee, Kathy Yuet-Sheung Lee, Michael Chi-Fai Tong

FinChat: Corpus and Evaluation Setup for Finnish Chat Conversations on Everyday Topics
Katri Leino, Juho Leinonen, Mittul Singh, Sami Virpioja, Mikko Kurimo

DiPCo — Dinner Party Corpus
Maarten Van Segbroeck, Ahmed Zaid, Ksenia Kutsenko, Cirenia Huerta, Tinh Nguyen, Xuewen Luo, Björn Hoffmeister, Jan Trmal, Maurizio Omologo, Roland Maas

Learning to Detect Bipolar Disorder and Borderline Personality Disorder with Language and Speech in Non-Clinical Interviews
Bo Wang, Yue Wu, Niall Taylor, Terry Lyons, Maria Liakata, Alejo J. Nevado-Holgado, Kate E.A. Saunders

FT Speech: Danish Parliament Speech Corpus
Andreas Kirkedal, Marija Stepanović, Barbara Plank












Spoken Language Understanding I


End-to-End Neural Transformer Based Spoken Language Understanding
Martin Radfar, Athanasios Mouchtaris, Siegfried Kunzmann

Jointly Encoding Word Confusion Network and Dialogue Context with BERT for Spoken Language Understanding
Chen Liu, Su Zhu, Zijian Zhao, Ruisheng Cao, Lu Chen, Kai Yu

Speech to Semantics: Improve ASR and NLU Jointly via All-Neural Interfaces
Milind Rao, Anirudh Raju, Pranav Dheram, Bach Bui, Ariya Rastrow

Pretrained Semantic Speech Embeddings for End-to-End Spoken Language Understanding via Cross-Modal Teacher-Student Learning
Pavel Denisov, Ngoc Thang Vu

Context Dependent RNNLM for Automatic Transcription of Conversations
Srikanth Raj Chetupalli, Sriram Ganapathy

Improving End-to-End Speech-to-Intent Classification with Reptile
Yusheng Tian, Philip John Gorinski

Speech to Text Adaptation: Towards an Efficient Cross-Modal Distillation
Won Ik Cho, Donghyun Kwak, Ji Won Yoon, Nam Soo Kim

Towards an ASR Error Robust Spoken Language Understanding System
Weitong Ruan, Yaroslav Nechaev, Luoxin Chen, Chengwei Su, Imre Kiss

End-to-End Spoken Language Understanding Without Full Transcripts
Hong-Kwang J. Kuo, Zoltán Tüske, Samuel Thomas, Yinghui Huang, Kartik Audhkhasi, Brian Kingsbury, Gakuto Kurata, Zvi Kons, Ron Hoory, Luis Lastras

Are Neural Open-Domain Dialog Systems Robust to Speech Recognition Errors in the Dialog History? An Empirical Study
Karthik Gopalakrishnan, Behnam Hedayatnia, Longshaokan Wang, Yang Liu, Dilek Hakkani-Tür








Acoustic Scene Classification


Neural Architecture Search on Acoustic Scene Classification
Jixiang Li, Chuming Liang, Bo Zhang, Zhao Wang, Fei Xiang, Xiangxiang Chu

Acoustic Scene Classification Using Audio Tagging
Jee-weon Jung, Hye-jin Shim, Ju-ho Kim, Seung-bin Kim, Ha-Jin Yu

ATReSN-Net: Capturing Attentive Temporal Relations in Semantic Neighborhood for Acoustic Scene Classification
Liwen Zhang, Jiqing Han, Ziqiang Shi

Environment Sound Classification Using Multiple Feature Channels and Attention Based Deep Convolutional Neural Network
Jivitesh Sharma, Ole-Christoffer Granmo, Morten Goodwin

Acoustic Scene Analysis with Multi-Head Attention Networks
Weimin Wang, Weiran Wang, Ming Sun, Chao Wang

Relational Teacher Student Learning with Neural Label Embedding for Device Adaptation in Acoustic Scene Classification
Hu Hu, Sabato Marco Siniscalchi, Yannan Wang, Chin-Hui Lee

An Acoustic Segment Model Based Segment Unit Selection Approach to Acoustic Scene Classification with Partial Utterances
Hu Hu, Sabato Marco Siniscalchi, Yannan Wang, Xue Bai, Jun Du, Chin-Hui Lee

Attention-Driven Projections for Soundscape Classification
Dhanunjaya Varma Devalraju, Muralikrishna H., Padmanabhan Rajan, Dileep Aroor Dinesh

Computer Audition for Continuous Rainforest Occupancy Monitoring: The Case of Bornean Gibbons’ Call Detection
Panagiotis Tzirakis, Alexander Shiarella, Robert Ewers, Björn W. Schuller

Deep Learning Based Open Set Acoustic Scene Classification
Zuzanna Kwiatkowska, Beniamin Kalinowski, Michał Kośmider, Krzysztof Rykaczewski










Speaker Recognition I


Improved RawNet with Feature Map Scaling for Text-Independent Speaker Verification Using Raw Waveforms
Jee-weon Jung, Seung-bin Kim, Hye-jin Shim, Ju-ho Kim, Ha-Jin Yu

Improving Multi-Scale Aggregation Using Feature Pyramid Module for Robust Speaker Verification of Variable-Duration Utterances
Youngmoon Jung, Seong Min Kye, Yeunju Choi, Myunghun Jung, Hoirin Kim

An Adaptive X-Vector Model for Text-Independent Speaker Verification
Bin Gu, Wu Guo, Fenglin Ding, Zhen-Hua Ling, Jun Du

Shouted Speech Compensation for Speaker Verification Robust to Vocal Effort Conditions
Santi Prieto, Alfonso Ortega, Iván López-Espejo, Eduardo Lleida

Sum-Product Networks for Robust Automatic Speaker Identification
Aaron Nicolson, Kuldip K. Paliwal

Segment Aggregation for Short Utterances Speaker Verification Using Raw Waveforms
Seung-bin Kim, Jee-weon Jung, Hye-jin Shim, Ju-ho Kim, Ha-Jin Yu

Siamese X-Vector Reconstruction for Domain Adapted Speaker Recognition
Shai Rozenberg, Hagai Aronowitz, Ron Hoory

Speaker Re-Identification with Speaker Dependent Speech Enhancement
Yanpei Shi, Qiang Huang, Thomas Hain

Blind Speech Signal Quality Estimation for Speaker Verification Systems
Galina Lavrentyeva, Marina Volkova, Anastasia Avdeeva, Sergey Novoselov, Artem Gorlanov, Tseren Andzhukaev, Artem Ivanov, Alexander Kozlov

Investigating Robustness of Adversarial Samples Detection for Automatic Speaker Verification
Xu Li, Na Li, Jinghua Zhong, Xixin Wu, Xunying Liu, Dan Su, Dong Yu, Helen Meng













Speech Synthesis Paradigms and Methods I


Using Cyclic Noise as the Source Signal for Neural Source-Filter-Based Speech Waveform Model
Xin Wang, Junichi Yamagishi

Unconditional Audio Generation with Generative Adversarial Networks and Cycle Regularization
Jen-Yu Liu, Yu-Hua Chen, Yin-Cheng Yeh, Yi-Hsuan Yang

Complex-Valued Variational Autoencoder: A Novel Deep Generative Model for Direct Representation of Complex Spectra
Toru Nakashika

Attentron: Few-Shot Text-to-Speech Utilizing Attention-Based Variable-Length Embedding
Seungwoo Choi, Seungju Han, Dongyoung Kim, Sungjoo Ha

Reformer-TTS: Neural Speech Synthesis with Reformer Network
Hyeong Rae Ihm, Joun Yeop Lee, Byoung Jin Choi, Sung Jun Cheon, Nam Soo Kim

CycleGAN-VC3: Examining and Improving CycleGAN-VCs for Mel-Spectrogram Conversion
Takuhiro Kaneko, Hirokazu Kameoka, Kou Tanaka, Nobukatsu Hojo

High Quality Streaming Speech Synthesis with Low, Sentence-Length-Independent Latency
Nikolaos Ellinas, Georgios Vamvoukakis, Konstantinos Markopoulos, Aimilios Chalamandaris, Georgia Maniati, Panos Kakoulidis, Spyros Raptis, June Sig Sung, Hyoungmin Park, Pirros Tsiakoulis

DurIAN: Duration Informed Attention Network for Speech Synthesis
Chengzhu Yu, Heng Lu, Na Hu, Meng Yu, Chao Weng, Kun Xu, Peng Liu, Deyi Tuo, Shiyin Kang, Guangzhi Lei, Dan Su, Dong Yu

Multi-Speaker Text-to-Speech Synthesis Using Deep Gaussian Processes
Kentaro Mitsui, Tomoki Koriyama, Hiroshi Saruwatari

A Hybrid HMM-Waveglow Based Text-to-Speech Synthesizer Using Histogram Equalization for Low Resource Indian Languages
Mano Ranjith Kumar M., Sudhanshu Srivastava, Anusha Prakash, Hema A. Murthy


The INTERSPEECH 2020 Computational Paralinguistics ChallengE (ComParE)


The INTERSPEECH 2020 Computational Paralinguistics Challenge: Elderly Emotion, Breathing & Masks
Björn W. Schuller, Anton Batliner, Christian Bergler, Eva-Maria Messner, Antonia Hamilton, Shahin Amiriparian, Alice Baird, Georgios Rizos, Maximilian Schmitt, Lukas Stappen, Harald Baumeister, Alexis Deighton MacIntyre, Simone Hantke

Learning Higher Representations from Pre-Trained Deep Models with Data Augmentation for the COMPARE 2020 Challenge Mask Task
Tomoya Koike, Kun Qian, Björn W. Schuller, Yoshiharu Yamamoto

Surgical Mask Detection with Convolutional Neural Networks and Data Augmentations on Spectrograms
Steffen Illium, Robert Müller, Andreas Sedlmeier, Claudia Linnhoff-Popien

Surgical Mask Detection with Deep Recurrent Phonetic Models
Philipp Klumpp, Tomás Arias-Vergara, Juan Camilo Vásquez-Correa, Paula Andrea Pérez-Toro, Florian Hönig, Elmar Nöth, Juan Rafael Orozco-Arroyave

Phonetic, Frame Clustering and Intelligibility Analyses for the INTERSPEECH 2020 ComParE Challenge
Claude Montacié, Marie-José Caraty

Exploring Text and Audio Embeddings for Multi-Dimension Elderly Emotion Recognition
Mariana Julião, Alberto Abad, Helena Moniz

Ensembling End-to-End Deep Models for Computational Paralinguistics Tasks: ComParE 2020 Mask and Breathing Sub-Challenges
Maxim Markitantov, Denis Dresvyanskiy, Danila Mamontov, Heysem Kaya, Wolfgang Minker, Alexey Karpov

Analyzing Breath Signals for the Interspeech 2020 ComParE Challenge
John Mendonça, Francisco Teixeira, Isabel Trancoso, Alberto Abad

Deep Attentive End-to-End Continuous Breath Sensing from Speech
Alexis Deighton MacIntyre, Georgios Rizos, Anton Batliner, Alice Baird, Shahin Amiriparian, Antonia Hamilton, Björn W. Schuller

Paralinguistic Classification of Mask Wearing by Image Classifiers and Fusion
Jeno Szep, Salim Hariri

Exploration of Acoustic and Lexical Cues for the INTERSPEECH 2020 Computational Paralinguistic Challenge
Ziqing Yang, Zifan An, Zehao Fan, Chengye Jing, Houwei Cao

Is Everything Fine, Grandma? Acoustic and Linguistic Modeling for Robust Elderly Speech Emotion Recognition
Gizem Soğancıoğlu, Oxana Verkholyak, Heysem Kaya, Dmitrii Fedotov, Tobias Cadée, Albert Ali Salah, Alexey Karpov

Are you Wearing a Mask? Improving Mask Detection from Speech Using Augmentation by Cycle-Consistent GANs
Nicolae-Cătălin Ristea, Radu Tudor Ionescu



Alzheimer’s Dementia Recognition Through Spontaneous Speech


Tackling the ADReSS Challenge: A Multimodal Approach to the Automated Recognition of Alzheimer’s Dementia
Matej Martinc, Senja Pollak

Disfluencies and Fine-Tuning Pre-Trained Language Models for Detection of Alzheimer’s Disease
Jiahong Yuan, Yuchen Bian, Xingyu Cai, Jiaji Huang, Zheng Ye, Kenneth Church

To BERT or not to BERT: Comparing Speech and Language-Based Approaches for Alzheimer’s Disease Detection
Aparna Balagopalan, Benjamin Eyre, Frank Rudzicz, Jekaterina Novikova

Alzheimer’s Dementia Recognition Through Spontaneous Speech: The ADReSS Challenge
Saturnino Luz, Fasih Haider, Sofia de la Fuente, Davida Fromm, Brian MacWhinney

Using State of the Art Speaker Recognition and Natural Language Processing Technologies to Detect Alzheimer’s Disease and Assess its Severity
Raghavendra Pappagari, Jaejin Cho, Laureano Moro-Velázquez, Najim Dehak

A Comparison of Acoustic and Linguistics Methodologies for Alzheimer’s Dementia Recognition
Nicholas Cummins, Yilin Pan, Zhao Ren, Julian Fritsch, Venkata Srikanth Nallanthighal, Heidi Christensen, Daniel Blackburn, Björn W. Schuller, Mathew Magimai-Doss, Helmer Strik, Aki Härmä

Multi-Modal Fusion with Gating Using Audio, Lexical and Disfluency Features for Alzheimer’s Dementia Recognition from Spontaneous Speech
Morteza Rohanian, Julian Hough, Matthew Purver

Comparing Natural Language Processing Techniques for Alzheimer’s Dementia Prediction in Spontaneous Speech
Thomas Searle, Zina Ibrahim, Richard Dobson

Multiscale System for Alzheimer’s Dementia Recognition Through Spontaneous Speech
Erik Edwards, Charles Dognin, Bajibabu Bollepalli, Maneesh Singh

The INESC-ID Multi-Modal System for the ADReSS 2020 Challenge
Anna Pompili, Thomas Rolland, Alberto Abad

Exploring MMSE Score Prediction Using Verbal and Non-Verbal Cues
Shahla Farzana, Natalie Parde

Multimodal Inductive Transfer Learning for Detection of Alzheimer’s Dementia and its Severity
Utkarsh Sarawgi, Wazeer Zulfikar, Nouran Soliman, Pattie Maes

Exploiting Multi-Modal Features from Pre-Trained Networks for Alzheimer’s Dementia Recognition
Junghyun Koo, Jie Hwan Lee, Jaewoo Pyo, Yujin Jo, Kyogu Lee

Automated Screening for Alzheimer’s Dementia Through Spontaneous Speech
Muhammad Shehram Shah Syed, Zafi Sherhan Syed, Margaret Lech, Elena Pirogova








Voice and Hearing Disorders


The Implication of Sound Level on Spatial Selective Auditory Attention for Cochlear Implant Users: Behavioral and Electrophysiological Measurement
Sara Akbarzadeh, Sungmin Lee, Chin-Tuan Tan

Enhancing the Interaural Time Difference of Bilateral Cochlear Implants with the Temporal Limits Encoder
Yangyang Wan, Huali Zhou, Qinglin Meng, Nengheng Zheng

Speech Clarity Improvement by Vocal Self-Training Using a Hearing Impairment Simulator and its Correlation with an Auditory Modulation Index
Toshio Irino, Soichi Higashiyama, Hanako Yoshigi

Investigation of Phase Distortion on Perceived Speech Quality for Hearing-Impaired Listeners
Zhuohuang Zhang, Donald S. Williamson, Yi Shen

EEG-Based Short-Time Auditory Attention Detection Using Multi-Task Deep Learning
Zhuo Zhang, Gaoyan Zhang, Jianwu Dang, Shuang Wu, Di Zhou, Longbiao Wang

Towards Interpreting Deep Learning Models to Understand Loss of Speech Intelligibility in Speech Disorders — Step 1: CNN Model-Based Phone Classification
Sondes Abderrazek, Corinne Fredouille, Alain Ghio, Muriel Lalain, Christine Meunier, Virginie Woisard

Improving Cognitive Impairment Classification by Generative Neural Network-Based Feature Augmentation
Bahman Mirheidari, Daniel Blackburn, Ronan O’Malley, Annalena Venneri, Traci Walker, Markus Reuber, Heidi Christensen

UncommonVoice: A Crowdsourced Dataset of Dysphonic Speech
Meredith Moore, Piyush Papreja, Michael Saxon, Visar Berisha, Sethuraman Panchanathan

Towards Automatic Assessment of Voice Disorders: A Clinical Approach
Purva Barche, Krishna Gurugubelli, Anil Kumar Vuppala

BlaBla: Linguistic Feature Extraction for Clinical Analysis in Multiple Languages
Abhishek Shivkumar, Jack Weston, Raphael Lenain, Emil Fristed








Training Strategies for ASR


Semi-Supervised ASR by End-to-End Self-Training
Yang Chen, Weiran Wang, Chao Wang

Improved Training Strategies for End-to-End Speech Recognition in Digital Voice Assistants
Hitesh Tulsiani, Ashtosh Sapru, Harish Arsikere, Surabhi Punjabi, Sri Garimella

Serialized Output Training for End-to-End Overlapped Speech Recognition
Naoyuki Kanda, Yashesh Gaur, Xiaofei Wang, Zhong Meng, Takuya Yoshioka

Semi-Supervised Learning with Data Augmentation for End-to-End ASR
Felix Weninger, Franco Mana, Roberto Gemello, Jesús Andrés-Ferrer, Puming Zhan

Efficient Minimum Word Error Rate Training of RNN-Transducer for End-to-End Speech Recognition
Jinxi Guo, Gautam Tiwari, Jasha Droppo, Maarten Van Segbroeck, Che-Wei Huang, Andreas Stolcke, Roland Maas

A New Training Pipeline for an Improved Neural Transducer
Albert Zeyer, André Merboldt, Ralf Schlüter, Hermann Ney

Improved Noisy Student Training for Automatic Speech Recognition
Daniel S. Park, Yu Zhang, Ye Jia, Wei Han, Chung-Cheng Chiu, Bo Li, Yonghui Wu, Quoc V. Le

Phoneme-to-Grapheme Conversion Based Large-Scale Pre-Training for End-to-End Automatic Speech Recognition
Ryo Masumura, Naoki Makishima, Mana Ihori, Akihiko Takashima, Tomohiro Tanaka, Shota Orihashi

Utterance Invariant Training for Hybrid Two-Pass End-to-End Speech Recognition
Dhananjaya Gowda, Ankur Kumar, Kwangyoun Kim, Hejung Yang, Abhinav Garg, Sachin Singh, Jiyeon Kim, Mehul Kumar, Sichen Jin, Shatrughan Singh, Chanwoo Kim

SCADA: Stochastic, Consistent and Adversarial Data Augmentation to Improve ASR
Gary Wang, Andrew Rosenberg, Zhehuai Chen, Yu Zhang, Bhuvana Ramabhadran, Pedro J. Moreno



Bioacoustics and Articulation


Transfer Learning of Articulatory Information Through Phone Information
Abdolreza Sabzi Shahrebabaki, Negar Olfati, Sabato Marco Siniscalchi, Giampiero Salvi, Torbjørn Svendsen

Sequence-to-Sequence Articulatory Inversion Through Time Convolution of Sub-Band Frequency Signals
Abdolreza Sabzi Shahrebabaki, Sabato Marco Siniscalchi, Giampiero Salvi, Torbjørn Svendsen

Discriminative Singular Spectrum Analysis for Bioacoustic Classification
Bernardo B. Gatto, Eulanda M. dos Santos, Juan G. Colonna, Naoya Sogi, Lincon S. Souza, Kazuhiro Fukui

Speech Rate Task-Specific Representation Learning from Acoustic-Articulatory Data
Renuka Mannem, Hima Jyothi R., Aravind Illa, Prasanta Kumar Ghosh

Dysarthria Detection and Severity Assessment Using Rhythm-Based Metrics
Abner Hernandez, Eun Jung Yeo, Sunhee Kim, Minhwa Chung

LungRN+NL: An Improved Adventitious Lung Sound Classification Using Non-Local Block ResNet Neural Network with Mixup Data Augmentation
Yi Ma, Xinzi Xu, Yongfu Li

Attention and Encoder-Decoder Based Models for Transforming Articulatory Movements at Different Speaking Rates
Abhayjeet Singh, Aravind Illa, Prasanta Kumar Ghosh

Adventitious Respiratory Classification Using Attentive Residual Neural Networks
Zijiang Yang, Shuo Liu, Meishu Song, Emilia Parada-Cabaleiro, Björn W. Schuller

Surfboard: Audio Feature Extraction for Modern Machine Learning
Raphael Lenain, Jack Weston, Abhishek Shivkumar, Emil Fristed

Whisper Activity Detection Using CNN-LSTM Based Attention Pooling Network Trained for a Speaker Identification Task
Abinay Reddy Naini, Malla Satyapriya, Prasanta Kumar Ghosh


Speech Synthesis: Multilingual and Cross-Lingual Approaches


Towards Natural Bilingual and Code-Switched Speech Synthesis Based on Mix of Monolingual Recordings and Cross-Lingual Voice Conversion
Shengkui Zhao, Trung Hieu Nguyen, Hao Wang, Bin Ma

Multi-Lingual Multi-Speaker Text-to-Speech Synthesis for Voice Cloning with Online Speaker Enrollment
Zhaoyu Liu, Brian Mak

Dynamic Soft Windowing and Language Dependent Style Token for Code-Switching End-to-End Speech Synthesis
Ruibo Fu, Jianhua Tao, Zhengqi Wen, Jiangyan Yi, Chunyu Qiang, Tao Wang

Phonological Features for 0-Shot Multilingual Speech Synthesis
Marlene Staib, Tian Huey Teh, Alexandra Torresquintero, Devang S. Ram Mohan, Lorenzo Foglianti, Raphael Lenain, Jiameng Gao

Cross-Lingual Text-To-Speech Synthesis via Domain Adaptation and Perceptual Similarity Regression in Speaker Space
Detai Xin, Yuki Saito, Shinnosuke Takamichi, Tomoki Koriyama, Hiroshi Saruwatari

Tone Learning in Low-Resource Bilingual TTS
Ruolan Liu, Xue Wen, Chunhui Lu, Xiao Chen

On Improving Code Mixed Speech Synthesis with Mixlingual Grapheme-to-Phoneme Model
Shubham Bansal, Arijit Mukherjee, Sandeepkumar Satpal, Rupeshkumar Mehta

Generic Indic Text-to-Speech Synthesisers with Rapid Adaptation in an End-to-End Framework
Anusha Prakash, Hema A. Murthy

Efficient Neural Speech Synthesis for Low-Resource Languages Through Multilingual Modeling
Marcel de Korte, Jaebok Kim, Esther Klabbers

One Model, Many Languages: Meta-Learning for Multilingual Text-to-Speech
Tomáš Nekvinda, Ondřej Dušek










Computational Resource Constrained Speech Recognition


Accurate Detection of Wake Word Start and End Using a CNN
Christin Jose, Yuriy Mishchenko, Thibaud Sénéchal, Anish Shah, Alex Escott, Shiv Naga Prasad Vitaladevuni

Hybrid Transformer/CTC Networks for Hardware Efficient Voice Triggering
Saurabh Adya, Vineet Garg, Siddharth Sigtia, Pramod Simha, Chandra Dhir

MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition
Somshubra Majumdar, Boris Ginsburg

Iterative Compression of End-to-End ASR Model Using AutoML
Abhinav Mehrotra, Łukasz Dudziak, Jinsu Yeo, Young-yoon Lee, Ravichander Vipperla, Mohamed S. Abdelfattah, Sourav Bhattacharya, Samin Ishtiaq, Alberto Gil C.P. Ramos, SangJeong Lee, Daehyun Kim, Nicholas D. Lane

Quantization Aware Training with Absolute-Cosine Regularization for Automatic Speech Recognition
Hieu Duy Nguyen, Anastasios Alexandridis, Athanasios Mouchtaris

Streaming On-Device End-to-End ASR System for Privacy-Sensitive Voice-Typing
Abhinav Garg, Gowtham P. Vadisetti, Dhananjaya Gowda, Sichen Jin, Aditya Jayasimha, Youngho Han, Jiyeon Kim, Junmo Park, Kwangyoun Kim, Sooyeon Kim, Young-yoon Lee, Kyungbo Min, Chanwoo Kim

Scaling Up Online Speech Recognition Using ConvNets
Vineel Pratap, Qiantong Xu, Jacob Kahn, Gilad Avidov, Tatiana Likhomanenko, Awni Hannun, Vitaliy Liptchinsky, Gabriel Synnaeve, Ronan Collobert

Listen Attentively, and Spell Once: Whole Sentence Generation via a Non-Autoregressive Architecture for Low-Latency Speech Recognition
Ye Bai, Jiangyan Yi, Jianhua Tao, Zhengkun Tian, Zhengqi Wen, Shuai Zhang

Rescore in a Flash: Compact, Cache Efficient Hashing Data Structures for n-Gram Language Models
Grant P. Strimel, Ariya Rastrow, Gautam Tiwari, Adrien Piérard, Jon Webb


Speech Synthesis: Prosody and Emotion


Multi-Speaker Emotion Conversion via Latent Variable Regularization and a Chained Encoder-Decoder-Predictor Network
Ravi Shankar, Hsi-Wei Hsieh, Nicolas Charon, Archana Venkataraman

Non-Parallel Emotion Conversion Using a Deep-Generative Hybrid Network and an Adversarial Pair Discriminator
Ravi Shankar, Jacob Sager, Archana Venkataraman

Laughter Synthesis: Combining Seq2seq Modeling with Transfer Learning
Noé Tits, Kevin El Haddad, Thierry Dutoit

Nonparallel Emotional Speech Conversion Using VAE-GAN
Yuexin Cao, Zhengchen Liu, Minchuan Chen, Jun Ma, Shaojun Wang, Jing Xiao

Principal Style Components: Expressive Style Control and Cross-Speaker Transfer in Neural TTS
Alexander Sorin, Slava Shechtman, Ron Hoory

Converting Anyone’s Emotion: Towards Speaker-Independent Emotional Voice Conversion
Kun Zhou, Berrak Sisman, Mingyang Zhang, Haizhou Li

Controlling the Strength of Emotions in Speech-Like Emotional Sound Generated by WaveNet
Kento Matsumoto, Sunao Hara, Masanobu Abe

Learning Syllable-Level Discrete Prosodic Representation for Expressive Speech Generation
Guangyan Zhang, Ying Qin, Tan Lee

Simultaneous Conversion of Speaker Identity and Emotion Based on Multiple-Domain Adaptive RBM
Takuya Kishida, Shin Tsukamoto, Toru Nakashika

Exploiting Deep Sentential Context for Expressive End-to-End Speech Synthesis
Fengyu Yang, Shan Yang, Qinghua Wu, Yujun Wang, Lei Xie

Hierarchical Multi-Grained Generative Model for Expressive Speech Synthesis
Yukiya Hono, Kazuna Tsuboi, Kei Sawada, Kei Hashimoto, Keiichiro Oura, Yoshihiko Nankaku, Keiichi Tokuda

GAN-Based Data Generation for Speech Emotion Recognition
Sefik Emre Eskimez, Dimitrios Dimitriadis, Robert Gmyr, Kenichi Kumanati

The Phonetic Bases of Vocal Expressed Emotion: Natural versus Acted
Hira Dhamyal, Shahan Ali Memon, Bhiksha Raj, Rita Singh



Multimodal Speech Processing


FaceFilter: Audio-Visual Speech Separation Using Still Images
Soo-Whan Chung, Soyeon Choe, Joon Son Chung, Hong-Goo Kang

Seeing Voices and Hearing Voices: Learning Discriminative Embeddings Using Cross-Modal Self-Supervision
Soo-Whan Chung, Hong-Goo Kang, Joon Son Chung

Fusion Architectures for Word-Based Audiovisual Speech Recognition
Michael Wand, Jürgen Schmidhuber

Audio-Visual Multi-Channel Recognition of Overlapped Speech
Jianwei Yu, Bo Wu, Rongzhi Gu, Shi-Xiong Zhang, Lianwu Chen, Yong Xu, Meng Yu, Dan Su, Dong Yu, Xunying Liu, Helen Meng

TMT: A Transformer-Based Modal Translator for Improving Multimodal Sequence Representations in Audio Visual Scene-Aware Dialog
Wubo Li, Dongwei Jiang, Wei Zou, Xiangang Li

Should we Hard-Code the Recurrence Concept or Learn it Instead ? Exploring the Transformer Architecture for Audio-Visual Speech Recognition
George Sterpu, Christian Saam, Naomi Harte

Resource-Adaptive Deep Learning for Visual Speech Recognition
Alexandros Koumparoulis, Gerasimos Potamianos, Samuel Thomas, Edmilson da Silva Morais

Speech-Image Semantic Alignment Does Not Depend on Any Prior Classification Tasks
Masood S. Mortazavi

Lip Graph Assisted Audio-Visual Speech Recognition Using Bidirectional Synchronous Fusion
Hong Liu, Zhan Chen, Bing Yang

Caption Alignment for Low Resource Audio-Visual Data
Vighnesh Reddy Konda, Mayur Warialani, Rakesh Prasanth Achari, Varad Bhatnagar, Jayaprakash Akula, Preethi Jyothi, Ganesh Ramakrishnan, Gholamreza Haffari, Pankaj Singh



Speech Synthesis: Neural Waveform Generation II


Vocoder-Based Speech Synthesis from Silent Videos
Daniel Michelsanti, Olga Slizovskaia, Gloria Haro, Emilia Gómez, Zheng-Hua Tan, Jesper Jensen

Quasi-Periodic Parallel WaveGAN Vocoder: A Non-Autoregressive Pitch-Dependent Dilated Convolution Model for Parametric Speech Generation
Yi-Chiao Wu, Tomoki Hayashi, Takuma Okamoto, Hisashi Kawai, Tomoki Toda

A Cyclical Post-Filtering Approach to Mismatch Refinement of Neural Vocoder for Text-to-Speech Systems
Yi-Chiao Wu, Patrick Lumban Tobing, Kazuki Yasuhara, Noriyuki Matsunaga, Yamato Ohtani, Tomoki Toda

Audio Dequantization for High Fidelity Audio Generation in Flow-Based Neural Vocoder
Hyun-Wook Yoon, Sang-Hoon Lee, Hyeong-Rae Noh, Seong-Whan Lee

StrawNet: Self-Training WaveNet for TTS in Low-Data Regimes
Manish Sharma, Tom Kenter, Rob Clark

An Efficient Subband Linear Prediction for LPCNet-Based Neural Synthesis
Yang Cui, Xi Wang, Lei He, Frank K. Soong

Reverberation Modeling for Source-Filter-Based Neural Vocoder
Yang Ai, Xin Wang, Junichi Yamagishi, Zhen-Hua Ling

Bunched LPCNet: Vocoder for Low-Cost Neural Text-To-Speech Systems
Ravichander Vipperla, Sangjun Park, Kihyun Choo, Samin Ishtiaq, Kyoungbo Min, Sourav Bhattacharya, Abhinav Mehrotra, Alberto Gil C.P. Ramos, Nicholas D. Lane

Neural Text-to-Speech with a Modeling-by-Generation Excitation Vocoder
Eunwoo Song, Min-Jae Hwang, Ryuichi Yamamoto, Jin-Seob Kim, Ohsung Kwon, Jae-Min Kim

SpeedySpeech: Efficient Neural Speech Synthesis
Jan Vainer, Ondřej Dušek











Speech Synthesis: Toward End-to-End Synthesis


From Speaker Verification to Multispeaker Speech Synthesis, Deep Transfer with Feedback Constraint
Zexin Cai, Chuxiong Zhang, Ming Li

Can Speaker Augmentation Improve Multi-Speaker End-to-End TTS?
Erica Cooper, Cheng-I Lai, Yusuke Yasuda, Junichi Yamagishi

Non-Autoregressive End-to-End TTS with Coarse-to-Fine Decoding
Tao Wang, Xuefei Liu, Jianhua Tao, Jiangyan Yi, Ruibo Fu, Zhengqi Wen

Bi-Level Speaker Supervision for One-Shot Speech Synthesis
Tao Wang, Jianhua Tao, Ruibo Fu, Jiangyan Yi, Zhengqi Wen, Chunyu Qiang

Naturalness Enhancement with Linguistic Information in End-to-End TTS Using Unsupervised Parallel Encoding
Alex Peiró-Lilja, Mireia Farrús

MoBoAligner: A Neural Alignment Model for Non-Autoregressive TTS with Monotonic Boundary Search
Naihan Li, Shujie Liu, Yanqing Liu, Sheng Zhao, Ming Liu, Ming Zhou

JDI-T: Jointly Trained Duration Informed Transformer for Text-To-Speech without Explicit Alignment
Dan Lim, Won Jang, Gyeonghwan O, Heayoung Park, Bongwan Kim, Jaesam Yoon

End-to-End Text-to-Speech Synthesis with Unaligned Multiple Language Units Based on Attention
Masashi Aso, Shinnosuke Takamichi, Hiroshi Saruwatari

Attention Forcing for Speech Synthesis
Qingyun Dou, Joshua Efiong, Mark J.F. Gales

Testing the Limits of Representation Mixing for Pronunciation Correction in End-to-End Speech Synthesis
Jason Fong, Jason Taylor, Simon King

MultiSpeech: Multi-Speaker Text to Speech with Transformer
Mingjian Chen, Xu Tan, Yi Ren, Jin Xu, Hao Sun, Sheng Zhao, Tao Qin


Speech Enhancement, Bandwidth Extension and Hearing Aids


Exploiting Conic Affinity Measures to Design Speech Enhancement Systems Operating in Unseen Noise Conditions
Pavlos Papadopoulos, Shrikanth Narayanan

Adversarial Dictionary Learning for Monaural Speech Enhancement
Yunyun Ji, Longting Xu, Wei-Ping Zhu

Semi-Supervised Self-Produced Speech Enhancement and Suppression Based on Joint Source Modeling of Air- and Body-Conducted Signals Using Variational Autoencoder
Shogo Seki, Moe Takada, Tomoki Toda

Spatial Covariance Matrix Estimation for Reverberant Speech with Application to Speech Enhancement
Ran Weisman, Vladimir Tourbabin, Paul Calamia, Boaz Rafaely

A Cross-Channel Attention-Based Wave-U-Net for Multi-Channel Speech Enhancement
Minh Tri Ho, Jinyoung Lee, Bong-Ki Lee, Dong Hoon Yi, Hong-Goo Kang

TinyLSTMs: Efficient Neural Speech Enhancement for Hearing Aids
Igor Fedorov, Marko Stamenovic, Carl Jensen, Li-Chia Yang, Ari Mandell, Yiming Gan, Matthew Mattina, Paul N. Whatmough

Intelligibility Enhancement Based on Speech Waveform Modification Using Hearing Impairment
Shu Hikosaka, Shogo Seki, Tomoki Hayashi, Kazuhiro Kobayashi, Kazuya Takeda, Hideki Banno, Tomoki Toda

Speaker and Phoneme-Aware Speech Bandwidth Extension with Residual Dual-Path Network
Nana Hou, Chenglin Xu, Van Tung Pham, Joey Tianyi Zhou, Eng Siong Chng, Haizhou Li

Multi-Task Learning for End-to-End Noise-Robust Bandwidth Extension
Nana Hou, Chenglin Xu, Joey Tianyi Zhou, Eng Siong Chng, Haizhou Li

Phase-Aware Music Super-Resolution Using Generative Adversarial Networks
Shichao Hu, Bin Zhang, Beici Liang, Ethan Zhao, Simon Lui






Summarization, Semantic Analysis and Classification


Multimodal Emotion Recognition Using Cross-Modal Attention and 1D Convolutional Neural Networks
Krishna D. N., Ankita Patil

Abstractive Spoken Document Summarization Using Hierarchical Model with Multi-Stage Attention Diversity Optimization
Potsawee Manakul, Mark J.F. Gales, Linlin Wang

Improved Learning of Word Embeddings with Word Definitions and Semantic Injection
Yichi Zhang, Yinpei Dai, Zhijian Ou, Huixin Wang, Junlan Feng

Wake Word Detection with Alignment-Free Lattice-Free MMI
Yiming Wang, Hang Lv, Daniel Povey, Lei Xie, Sanjeev Khudanpur

Improving Vietnamese Named Entity Recognition from Speech Using Word Capitalization and Punctuation Recovery Models
Thai Binh Nguyen, Quang Minh Nguyen, Thi Thu Hien Nguyen, Quoc Truong Do, Chi Mai Luong

End-to-End Named Entity Recognition from English Speech
Hemant Yadav, Sreyan Ghosh, Yi Yu, Rajiv Ratn Shah

Semantic Complexity in End-to-End Spoken Language Understanding
Joseph P. McKenna, Samridhi Choudhary, Michael Saxon, Grant P. Strimel, Athanasios Mouchtaris

Analysis of Disfluency in Children’s Speech
Trang Tran, Morgan Tinkler, Gary Yeung, Abeer Alwan, Mari Ostendorf

Representation Based Meta-Learning for Few-Shot Spoken Intent Recognition
Ashish Mittal, Samarth Bharadwaj, Shreya Khare, Saneem Chemmengath, Karthik Sankaranarayanan, Brian Kingsbury

Complementary Language Model and Parallel Bi-LRNN for False Trigger Mitigation
Rishika Agarwal, Xiaochuan Niu, Pranay Dighe, Srikanth Vishnubhotla, Sameer Badaskar, Devang Naik



General Topics in Speech Recognition


State Sequence Pooling Training of Acoustic Models for Keyword Spotting
Kuba Łopatka, Tobias Bocklet

Training Keyword Spotting Models on Non-IID Data with Federated Learning
Andrew Hard, Kurt Partridge, Cameron Nguyen, Niranjan Subrahmanya, Aishanee Shah, Pai Zhu, Ignacio Lopez Moreno, Rajiv Mathews

Class LM and Word Mapping for Contextual Biasing in End-to-End ASR
Rongqing Huang, Ossama Abdel-hamid, Xinwei Li, Gunnar Evermann

Do End-to-End Speech Recognition Models Care About Context?
Lasse Borgholt, Jakob D. Havtorn, Željko Agić, Anders Søgaard, Lars Maaløe, Christian Igel

Utterance Confidence Measure for End-to-End Speech Recognition with Applications to Distributed Speech Recognition Scenarios
Ankur Kumar, Sachin Singh, Dhananjaya Gowda, Abhinav Garg, Shatrughan Singh, Chanwoo Kim

Speaker Code Based Speaker Adaptive Training Using Model Agnostic Meta-Learning
Huaxin Wu, Genshun Wan, Jia Pan

Domain Adaptation Using Class Similarity for Robust Speech Recognition
Han Zhu, Jiangjiang Zhao, Yuling Ren, Li Wang, Pengyuan Zhang

Incremental Machine Speech Chain Towards Enabling Listening While Speaking in Real-Time
Sashi Novitasari, Andros Tjandra, Tomoya Yanagita, Sakriani Sakti, Satoshi Nakamura

Context-Dependent Acoustic Modeling Without Explicit Phone Clustering
Tina Raissi, Eugen Beck, Ralf Schlüter, Hermann Ney

Voice Conversion Based Data Augmentation to Improve Children’s Speech Recognition in Limited Data Scenario
S. Shahnawazuddin, Nagaraj Adiga, Kunal Kumar, Aayushi Poddar, Waquar Ahmad


Speech Synthesis: Prosody Modeling


CopyCat: Many-to-Many Fine-Grained Prosody Transfer for Neural Text-to-Speech
Sri Karlapati, Alexis Moinet, Arnaud Joly, Viacheslav Klimkov, Daniel Sáez-Trigueros, Thomas Drugman

Joint Detection of Sentence Stress and Phrase Boundary for Prosody
Binghuai Lin, Liyuan Wang, Xiaoli Feng, Jinsong Zhang

Transfer Learning of the Expressivity Using FLOW Metric Learning in Multispeaker Text-to-Speech Synthesis
Ajinkya Kulkarni, Vincent Colotte, Denis Jouvet

Speaking Speed Control of End-to-End Speech Synthesis Using Sentence-Level Conditioning
Jae-Sung Bae, Hanbin Bae, Young-Sun Joo, Junmo Lee, Gyeong-Hoon Lee, Hoon-Young Cho

Dynamic Prosody Generation for Speech Synthesis Using Linguistics-Driven Acoustic Embedding Selection
Shubhi Tyagi, Marco Nicolis, Jonas Rohnke, Thomas Drugman, Jaime Lorenzo-Trueba

Improving the Prosody of RNN-Based English Text-To-Speech Synthesis by Incorporating a BERT Model
Tom Kenter, Manish Sharma, Rob Clark

Improved Prosody from Learned F0 Codebook Representations for VQ-VAE Speech Waveform Reconstruction
Yi Zhao, Haoyu Li, Cheng-I Lai, Jennifer Williams, Erica Cooper, Junichi Yamagishi

Prosody Learning Mechanism for Speech Synthesis System Without Text Length Limit
Zhen Zeng, Jianzong Wang, Ning Cheng, Jing Xiao

Discriminative Method to Extract Coarse Prosodic Structure and its Application for Statistical Phrase/Accent Command Estimation
Yuma Shirahata, Daisuke Saito, Nobuaki Minematsu

Controllable Neural Text-to-Speech Synthesis Using Intuitive Prosodic Features
Tuomo Raitio, Ramya Rasipuram, Dan Castellani

Controllable Neural Prosody Synthesis
Max Morrison, Zeyu Jin, Justin Salamon, Nicholas J. Bryan, Gautham J. Mysore

Multi-Reference Neural TTS Stylization with Adversarial Cycle Consistency
Matt Whitehill, Shuang Ma, Daniel McDuff, Yale Song

Interactive Text-to-Speech System via Joint Style Analysis
Yang Gao, Weiyi Zheng, Zhaojun Yang, Thilo Köhler, Christian Fuegen, Qing He




Speech in Health II


Squeeze for Sneeze: Compact Neural Networks for Cold and Flu Recognition
Merlin Albes, Zhao Ren, Björn W. Schuller, Nicholas Cummins

Extended Study on the Use of Vocal Tract Variables to Quantify Neuromotor Coordination in Depression
Nadee Seneviratne, James R. Williamson, Adam C. Lammert, Thomas F. Quatieri, Carol Espy-Wilson

Affective Conditioning on Hierarchical Attention Networks Applied to Depression Detection from Transcribed Clinical Interviews
Danai Xezonaki, Georgios Paraskevopoulos, Alexandros Potamianos, Shrikanth Narayanan

Domain Adaptation for Enhancing Speech-Based Depression Detection in Natural Environmental Conditions Using Dilated CNNs
Zhaocheng Huang, Julien Epps, Dale Joachim, Brian Stasak, James R. Williamson, Thomas F. Quatieri

Making a Distinction Between Schizophrenia and Bipolar Disorder Based on Temporal Parameters in Spontaneous Speech
Gábor Gosztolya, Anita Bagi, Szilvia Szalóki, István Szendi, Ildikó Hoffmann

Prediction of Sleepiness Ratings from Voice by Man and Machine
Mark Huckvale, András Beke, Mirei Ikushima

Tongue and Lip Motion Patterns in Alaryngeal Speech
Kristin J. Teplansky, Alan Wisler, Beiming Cao, Wendy Liang, Chad W. Whited, Ted Mau, Jun Wang

Autoencoder Bottleneck Features with Multi-Task Optimisation for Improved Continuous Dysarthric Speech Recognition
Zhengjun Yue, Heidi Christensen, Jon Barker

Raw Speech Waveform Based Classification of Patients with ALS, Parkinson’s Disease and Healthy Controls Using CNN-BLSTM
Jhansi Mallela, Aravind Illa, Yamini Belur, Nalini Atchayaram, Ravi Yadav, Pradeep Reddy, Dipanjan Gope, Prasanta Kumar Ghosh

Assessment of Parkinson’s Disease Medication State Through Automatic Speech Analysis
Anna Pompili, Rubén Solera-Ureña, Alberto Abad, Rita Cardoso, Isabel Guimarães, Margherita Fabbri, Isabel P. Martins, Joaquim Ferreira




Voice Conversion and Adaptation II


Voice Transformer Network: Sequence-to-Sequence Voice Conversion Using Transformer with Text-to-Speech Pretraining
Wen-Chin Huang, Tomoki Hayashi, Yi-Chiao Wu, Hirokazu Kameoka, Tomoki Toda

Nonparallel Training of Exemplar-Based Voice Conversion System Using INCA-Based Alignment Technique
Hitoshi Suda, Gaku Kotani, Daisuke Saito

Enhancing Intelligibility of Dysarthric Speech Using Gated Convolutional-Based Voice Conversion System
Chen-Yu Chen, Wei-Zhong Zheng, Syu-Siang Wang, Yu Tsao, Pei-Chun Li, Ying-Hui Lai

VQVC+: One-Shot Voice Conversion by Vector Quantization and U-Net Architecture
Da-Yi Wu, Yen-Hao Chen, Hung-yi Lee

Cotatron: Transcription-Guided Speech Encoder for Any-to-Many Voice Conversion Without Parallel Data
Seung-won Park, Doo-young Kim, Myun-chul Joe

Dynamic Speaker Representations Adjustment and Decoder Factorization for Speaker Adaptation in End-to-End Speech Synthesis
Ruibo Fu, Jianhua Tao, Zhengqi Wen, Jiangyan Yi, Tao Wang, Chunyu Qiang

ARVC: An Auto-Regressive Voice Conversion System Without Parallel Training Data
Zheng Lian, Zhengqi Wen, Xinyong Zhou, Songbai Pu, Shengkai Zhang, Jianhua Tao

Improved Zero-Shot Voice Conversion Using Explicit Conditioning Signals
Shahan Nercessian

Non-Parallel Voice Conversion with Fewer Labeled Data by Conditional Generative Adversarial Networks
Minchuan Chen, Weijian Hou, Jun Ma, Shaojun Wang, Jing Xiao

Transferring Source Style in Non-Parallel Voice Conversion
Songxiang Liu, Yuewen Cao, Shiyin Kang, Na Hu, Xunying Liu, Dan Su, Dong Yu, Helen Meng

Voice Conversion Using Speech-to-Speech Neuro-Style Transfer
Ehab A. AlBadawy, Siwei Lyu


Multilingual and Code-Switched ASR


Improving Cross-Lingual Transfer Learning for End-to-End Speech Recognition with Speech Translation
Changhan Wang, Juan Pino, Jiatao Gu

Transliteration Based Data Augmentation for Training Multilingual ASR Acoustic Models in Low Resource Settings
Samuel Thomas, Kartik Audhkhasi, Brian Kingsbury

Multilingual Speech Recognition with Self-Attention Structured Parameterization
Yun Zhu, Parisa Haghani, Anshuman Tripathi, Bhuvana Ramabhadran, Brian Farris, Hainan Xu, Han Lu, Hasim Sak, Isabel Leal, Neeraj Gaur, Pedro J. Moreno, Qian Zhang

Lattice-Free Maximum Mutual Information Training of Multilingual Speech Recognition Systems
Srikanth Madikeri, Banriskhem K. Khonglah, Sibo Tong, Petr Motlicek, Hervé Bourlard, Daniel Povey

Massively Multilingual ASR: 50 Languages, 1 Model, 1 Billion Parameters
Vineel Pratap, Anuroop Sriram, Paden Tomasello, Awni Hannun, Vitaliy Liptchinsky, Gabriel Synnaeve, Ronan Collobert

Multilingual Speech Recognition Using Language-Specific Phoneme Recognition as Auxiliary Task for Indian Languages
Hardik B. Sailor, Thomas Hain

Style Variation as a Vantage Point for Code-Switching
Khyathi Raghavi Chandu, Alan W. Black

Bi-Encoder Transformer Network for Mandarin-English Code-Switching Speech Recognition Using Mixture of Experts
Yizhou Lu, Mingkun Huang, Hao Li, Jiaqi Guo, Yanmin Qian

Improving Low Resource Code-Switched ASR Using Augmented Code-Switched TTS
Yash Sharma, Basil Abraham, Karan Taneja, Preethi Jyothi

Towards Context-Aware End-to-End Code-Switching Speech Recognition
Zimeng Qiu, Yiyuan Li, Xinjian Li, Florian Metze, William M. Campbell


Speech and Voice Disorders


Increasing the Intelligibility and Naturalness of Alaryngeal Speech Using Voice Conversion and Synthetic Fundamental Frequency
Tuan Dinh, Alexander Kain, Robin Samlan, Beiming Cao, Jun Wang

Automatic Assessment of Dysarthric Severity Level Using Audio-Video Cross-Modal Approach in Deep Learning
Han Tong, Hamid Sharifzadeh, Ian McLoughlin

Staged Knowledge Distillation for End-to-End Dysarthric Speech Recognition and Speech Attribute Transcription
Yuqin Lin, Longbiao Wang, Sheng Li, Jianwu Dang, Chenchen Ding

Dysarthric Speech Recognition Based on Deep Metric Learning
Yuki Takashima, Ryoichi Takashima, Tetsuya Takiguchi, Yasuo Ariki

Automatic Glottis Detection and Segmentation in Stroboscopic Videos Using Convolutional Networks
Divya Degala, Achuth Rao M.V., Rahul Krishnamurthy, Pebbili Gopikishore, Veeramani Priyadharshini, Prakash T.K., Prasanta Kumar Ghosh

Acoustic Feature Extraction with Interpretable Deep Neural Network for Neurodegenerative Related Disorder Classification
Yilin Pan, Bahman Mirheidari, Zehai Tu, Ronan O’Malley, Traci Walker, Annalena Venneri, Markus Reuber, Daniel Blackburn, Heidi Christensen

Coswara — A Database of Breathing, Cough, and Voice Sounds for COVID-19 Diagnosis
Neeraj Sharma, Prashant Krishnan, Rohit Kumar, Shreyas Ramoji, Srikanth Raj Chetupalli, Nirmala R., Prasanta Kumar Ghosh, Sriram Ganapathy

Acoustic-Based Articulatory Phenotypes of Amyotrophic Lateral Sclerosis and Parkinson’s Disease: Towards an Interpretable, Hypothesis-Driven Framework of Motor Control
Hannah P. Rowe, Sarah E. Gutz, Marc F. Maffei, Jordan R. Green

Recognising Emotions in Dysarthric Speech Using Typical Speech Data
Lubna Alhinti, Stuart Cunningham, Heidi Christensen

Detecting and Analysing Spontaneous Oral Cancer Speech in the Wild
Bence Mark Halpern, Rob van Son, Michiel van den Brekel, Odette Scharenborg


The Zero Resource Speech Challenge 2020


The Zero Resource Speech Challenge 2020: Discovering Discrete Subword and Word Units
Ewan Dunbar, Julien Karadayi, Mathieu Bernard, Xuan-Nga Cao, Robin Algayres, Lucas Ondel, Laurent Besacier, Sakriani Sakti, Emmanuel Dupoux

Vector-Quantized Neural Networks for Acoustic Unit Discovery in the ZeroSpeech 2020 Challenge
Benjamin van Niekerk, Leanne Nortje, Herman Kamper

Exploration of End-to-End Synthesisers for Zero Resource Speech Challenge 2020
Karthik Pandia D.S., Anusha Prakash, Mano Ranjith Kumar M., Hema A. Murthy

Vector Quantized Temporally-Aware Correspondence Sparse Autoencoders for Zero-Resource Acoustic Unit Discovery
Batuhan Gundogdu, Bolaji Yusuf, Mansur Yesilbursa, Murat Saraclar

Transformer VQ-VAE for Unsupervised Unit Discovery and Speech Synthesis: ZeroSpeech 2020 Challenge
Andros Tjandra, Sakriani Sakti, Satoshi Nakamura

Exploring TTS Without T Using Biologically/Psychologically Motivated Neural Network Modules (ZeroSpeech 2020)
Takashi Morita, Hiroki Koda

Cyclic Spectral Modeling for Unsupervised Unit Discovery into Voice Conversion with Excitation and Waveform Modeling
Patrick Lumban Tobing, Tomoki Hayashi, Yi-Chiao Wu, Kazuhiro Kobayashi, Tomoki Toda

Unsupervised Acoustic Unit Representation Learning for Voice Conversion Using WaveNet Auto-Encoders
Mingjie Chen, Thomas Hain

Unsupervised Discovery of Recurring Speech Patterns Using Probabilistic Adaptive Metrics
Okko Räsänen, María Andrea Cruz Blandón

Self-Expressing Autoencoders for Unsupervised Spoken Term Discovery
Saurabhchand Bhati, Jesús Villalba, Piotr Żelasko, Najim Dehak

Perceptimatic: A Human Speech Perception Benchmark for Unsupervised Subword Modelling
Juliette Millet, Ewan Dunbar

Decoding Imagined, Heard, and Spoken Speech: Classification and Regression of EEG Using a 14-Channel Dry-Contact Mobile Headset
Jonathan Clayton, Scott Wellington, Cassia Valentini-Botinhao, Oliver Watts

Glottal Closure Instants Detection from EGG Signal by Classification Approach
Gurunath Reddy M., K. Sreenivasa Rao, Partha Pratim Das

Classify Imaginary Mandarin Tones with Cortical EEG Signals
Hua Li, Fei Chen



Speech in Health I


An Early Study on Intelligent Analysis of Speech Under COVID-19: Severity, Sleep Quality, Fatigue, and Anxiety
Jing Han, Kun Qian, Meishu Song, Zijiang Yang, Zhao Ren, Shuo Liu, Juan Liu, Huaiyuan Zheng, Wei Ji, Tomoya Koike, Xiao Li, Zixing Zhang, Yoshiharu Yamamoto, Björn W. Schuller

An Evaluation of the Effect of Anxiety on Speech — Computational Prediction of Anxiety from Sustained Vowels
Alice Baird, Nicholas Cummins, Sebastian Schnieder, Jarek Krajewski, Björn W. Schuller

Hybrid Network Feature Extraction for Depression Assessment from Speech
Ziping Zhao, Qifei Li, Nicholas Cummins, Bin Liu, Haishuai Wang, Jianhua Tao, Björn W. Schuller

Improving Detection of Alzheimer’s Disease Using Automatic Speech Recognition to Identify High-Quality Segments for More Robust Feature Extraction
Yilin Pan, Bahman Mirheidari, Markus Reuber, Annalena Venneri, Daniel Blackburn, Heidi Christensen

Classification of Manifest Huntington Disease Using Vowel Distortion Measures
Amrit Romana, John Bandon, Noelle Carlozzi, Angela Roberts, Emily Mower Provost

Parkinson’s Disease Detection from Speech Using Single Frequency Filtering Cepstral Coefficients
Sudarsana Reddy Kadiri, Rashmi Kethireddy, Paavo Alku

Automatic Prediction of Speech Intelligibility Based on X-Vectors in the Context of Head and Neck Cancer
Sebastião Quintas, Julie Mauclair, Virginie Woisard, Julien Pinquier

Spectral Moment and Duration of Burst of Plosives in Speech of Children with Hearing Impairment and Typically Developing Children — A Comparative Study
Ajish K. Abraham, M. Pushpavathi, N. Sreedevi, A. Navya, C.M. Vikram, S.R. Mahadeva Prasanna

Aphasic Speech Recognition Using a Mixture of Speech Intelligibility Experts
Matthew Perez, Zakaria Aldeneh, Emily Mower Provost

Automatic Discrimination of Apraxia of Speech and Dysarthria Using a Minimalistic Set of Handcrafted Features
Ina Kodrasi, Michaela Pernon, Marina Laganaro, Hervé Bourlard


ASR Neural Network Architectures II — Transformers


Weak-Attention Suppression for Transformer Based Speech Recognition
Yangyang Shi, Yongqiang Wang, Chunyang Wu, Christian Fuegen, Frank Zhang, Duc Le, Ching-Feng Yeh, Michael L. Seltzer

Conv-Transformer Transducer: Low Latency, Low Frame Rate, Streamable End-to-End Speech Recognition
Wenyong Huang, Wenchao Hu, Yu Ting Yeung, Xiao Chen

Improving Transformer-Based Speech Recognition with Unsupervised Pre-Training and Multi-Task Semantic Knowledge Learning
Song Li, Lin Li, Qingyang Hong, Lingling Liu

Transformer-Based Long-Context End-to-End Speech Recognition
Takaaki Hori, Niko Moritz, Chiori Hori, Jonathan Le Roux

Self-and-Mixed Attention Decoder with Deep Acoustic Structure for Transformer-Based LVCSR
Xinyuan Zhou, Grandee Lee, Emre Yılmaz, Yanhua Long, Jiaen Liang, Haizhou Li

Universal Speech Transformer
Yingzhu Zhao, Chongjia Ni, Cheung-Chi Leung, Shafiq Joty, Eng Siong Chng, Bin Ma

Spike-Triggered Non-Autoregressive Transformer for End-to-End Speech Recognition
Zhengkun Tian, Jiangyan Yi, Jianhua Tao, Ye Bai, Shuai Zhang, Zhengqi Wen

Cross Attention with Monotonic Alignment for Speech Transformer
Yingzhu Zhao, Chongjia Ni, Cheung-Chi Leung, Shafiq Joty, Eng Siong Chng, Bin Ma

Conformer: Convolution-augmented Transformer for Speech Recognition
Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki Parmar, Yu Zhang, Jiahui Yu, Wei Han, Shibo Wang, Zhengdong Zhang, Yonghui Wu, Ruoming Pang

Exploring Transformers for Large-Scale Speech Recognition
Liang Lu, Changliang Liu, Jinyu Li, Yifan Gong



×

Keynote 1

ASR Neural Network Architectures I

Multi-Channel Speech Enhancement

Speech Processing in the Brain

Speech Signal Representation

Speech Synthesis: Neural Waveform Generation I

Automatic Speech Recognition for Non-Native Children’s Speech

Speaker Diarization

Noise Robust and Distant Speech Recognition

Speech in Multimodality

Speech, Language, and Multimodal Resources

Language Recognition

Speech Processing and Analysis

Speech Emotion Recognition I

ASR Neural Network Architectures and Training I

Evaluation of Speech Technology Systems and Methods for Resource Construction and Annotation

Phonetics and Phonology

Topics in ASR I

Large-Scale Evaluation of Short-Duration Speaker Verification

Voice Conversion and Adaptation I

Acoustic Event Detection

Spoken Language Understanding I

DNN Architectures for Speaker Recognition

ASR Model Training and Strategies

Speech Annotation and Speech Assessment

Cross/Multi-Lingual and Code-Switched Speech Recognition

Anti-Spoofing and Liveness Detection

Noise Reduction and Intelligibility

Acoustic Scene Classification

Singing Voice Computing and Processing in Music

Acoustic Model Adaptation for ASR

Singing and Multimodal Synthesis

Intelligibility-Enhancing Speech Modification

Human Speech Production I

Targeted Source Separation

Keynote 2

Speech Translation and Multilingual/Multimodal Learning

Speaker Recognition I

Spoken Language Understanding II

Human Speech Processing

Feature Extraction and Distant ASR

Voice Privacy Challenge

Speech Synthesis: Text Processing, Data and Evaluation

Search for Speech Recognition

Computational Paralinguistics I

Acoustic Phonetics and Prosody

Keynote 3

Tonal Aspects of Acoustic Phonetics and Prosody

Speech Classification

Speech Synthesis Paradigms and Methods I

The INTERSPEECH 2020 Computational Paralinguistics ChallengE (ComParE)

Streaming ASR

Alzheimer’s Dementia Recognition Through Spontaneous Speech

Speaker Recognition Challenges and Applications

Applications of ASR

Speech Emotion Recognition II

Bi- and Multilinguality

Single-Channel Speech Enhancement I

Deep Noise Suppression Challenge

Voice and Hearing Disorders

Spoken Term Detection

The Fearless Steps Challenge Phase-02

Monaural Source Separation

Single-Channel Speech Enhancement II

Topics in ASR II

Neural Signals for Spoken Communication

Training Strategies for ASR

Speech Transmission & Coding

Bioacoustics and Articulation

Speech Synthesis: Multilingual and Cross-Lingual Approaches

Learning Techniques for Speaker Recognition I

Pronunciation

Diarization

Computational Paralinguistics II

Speech Synthesis Paradigms and Methods II

Speaker Embedding

Single-Channel Speech Enhancement III

Multi-Channel Audio and Emotion Recognition

Computational Resource Constrained Speech Recognition

Speech Synthesis: Prosody and Emotion

The Interspeech 2020 Far Field Speaker Verification Challenge

Multimodal Speech Processing

Keynote 4

Speech Synthesis: Neural Waveform Generation II

ASR Neural Network Architectures and Training II

Neural Networks for Language Modeling

Phonetic Event Detection and Segmentation

Human Speech Production II

New Trends in Self-Supervised Speech Processing

Learning Techniques for Speaker Recognition II

Spoken Language Evaluatiosn

Spoken Dialogue System

Dereverberation and Echo Cancellation

Speech Synthesis: Toward End-to-End Synthesis

Speech Enhancement, Bandwidth Extension and Hearing Aids

Speech Emotion Recognition III

Accoustic Phonetics of L1-L2 and Other Interactions

Conversational Systems

The Attacker’s Perpective on Automatic Speaker Verification

Summarization, Semantic Analysis and Classification

Speaker Recognition II

General Topics in Speech Recognition

Speech Synthesis: Prosody Modeling

Language Learning

Speech Enhancement

Speech in Health II

Speech and Audio Quality Assessment

Privacy and Security in Speech Communication

Voice Conversion and Adaptation II

Multilingual and Code-Switched ASR

Speech and Voice Disorders

The Zero Resource Speech Challenge 2020

LM Adaptation, Lexical Units and Punctuation

Speech in Health I

ASR Neural Network Architectures II — Transformers

Spatial Audio