ISCA Archive Interspeech 2022 Sessions Website Booklet
  ISCA Archive Sessions Website Booklet
top

Interspeech 2022

Incheon, Korea
18-22 September 2022

Chairs: Hanseok Ko and John H. L. Hansen
doi: 10.21437/Interspeech.2022






Dereverberation, Noise Reduction, and Speaker Extraction


Speak Like a Professional: Increasing Speech Intelligibility by Mimicking Professional Announcer Voice with Voice Conversion
Tuan Vu Ho, Maori Kobayashi, Masato Akagi

Vector-quantized Variational Autoencoder for Phase-aware Speech Enhancement
Tuan Vu Ho, Quoc Huy Nguyen, Masato Akagi, Masashi Unoki

iDeepMMSE: An improved deep learning approach to MMSE speech and noise power spectrum estimation for speech enhancement
Minseung Kim, Hyungchan Song, Sein Cheong, Jong Won Shin

Boosting Self-Supervised Embeddings for Speech Enhancement
Kuo-Hsuan Hung, Szu-wei Fu, Huan-Hsin Tseng, Hsin-Tien Chiang, Yu Tsao, Chii-Wann Lin

Monoaural Speech Enhancement Using a Nested U-Net with Two-Level Skip Connections
Seorim Hwang, Youngcheol Park, Sungwook Park

CycleGAN-based Unpaired Speech Dereverberation
Hannah Muckenhirn, Aleksandr Safin, Hakan Erdogan, Felix de Chaumont Quitry, Marco Tagliasacchi, Scott Wisdom, John R. Hershey

Attentive Training: A New Training Framework for Talker-independent Speaker Extraction
Ashutosh Pandey, DeLiang Wang

Improved Modulation-Domain Loss for Neural-Network-based Speech Enhancement
Tyler Vuong, Richard Stern

Perceptual Characteristics Based Multi-objective Model for Speech Enhancement
Chiang-Jen Peng, Yun-Ju Chan, Yih-Liang Shen, Cheng Yu, Yu Tsao, Tai-Shih Chi

Listen only to me! How well can target speech extraction handle false alarms?
Marc Delcroix, Keisuke Kinoshita, Tsubasa Ochiai, Katerina Zmolikova, Hiroshi Sato, Tomohiro Nakatani

Monaural Speech Enhancement Based on Spectrogram Decomposition for Convolutional Neural Network-sensitive Feature Extraction
Hao Shi, Longbiao Wang, Sheng Li, Jianwu Dang, Tatsuya Kawahara

Neural Network-augmented Kalman Filtering for Robust Online Speech Dereverberation in Noisy Reverberant Environments
Jean-Marie Lemercier, Joachim Thiemann, Raphael Koning, Timo Gerkmann



Embedding and Network Architecture for Speaker Recognition


Reliability criterion based on learning-phase entropy for speaker recognition with neural network
Pierre-Michel Bousquet, Mickael Rouvier, Jean-Francois Bonastre

Attentive Feature Fusion for Robust Speaker Verification
Bei Liu, Zhengyang Chen, Yanmin Qian

Dual Path Embedding Learning for Speaker Verification with Triplet Attention
Bei Liu, Zhengyang Chen, Yanmin Qian

DF-ResNet: Boosting Speaker Verification Performance with Depth-First Design
Bei Liu, Zhengyang Chen, Shuai Wang, Haoyu Wang, Bing Han, Yanmin Qian

Adaptive Rectangle Loss for Speaker Verification
Li Ruida, Fang Shuo, Ma Chenguang, Li Liang

MFA-Conformer: Multi-scale Feature Aggregation Conformer for Automatic Speaker Verification
Yang Zhang, Zhiqiang Lv, Haibin Wu, Shanshan Zhang, Pengfei Hu, Zhiyong Wu, Hung-yi Lee, Helen Meng

Enroll-Aware Attentive Statistics Pooling for Target Speaker Verification
Leying Zhang, Zhengyang Chen, Yanmin Qian

Transport-Oriented Feature Aggregation for Speaker Embedding Learning
Yusheng Tian, Jingyu Li, Tan Lee

Multi-Frequency Information Enhanced Channel Attention Module for Speaker Representation Learning
Mufan Sang, John H.L. Hansen

CS-CTCSCONV1D: Small footprint speaker verification with channel split time-channel-time separable 1-dimensional convolution
Linjun Cai, Yuhong Yang, Xufeng Chen, Weiping Tu, Hongyang Chen

Reliable Visualization for Deep Speaker Recognition
Pengqi Li, Lantian Li, Askar Hamdulla, Dong Wang

Unifying Cosine and PLDA Back-ends for Speaker Verification
Zhiyuan Peng, Xuanji He, Ke Ding, Tan Lee, Guanglu Wan

CTFALite: Lightweight Channel-specific Temporal and Frequency Attention Mechanism for Enhancing the Speaker Embedding Extractor
Yuheng Wei, Junzhao Du, Hui Liu, Qian Wang


Speech Representation II


SpeechFormer: A Hierarchical Efficient Framework Incorporating the Characteristics of Speech
Weidong Chen, Xiaofen Xing, Xiangmin Xu, Jianxin Pang, Lan Du

VoiceLab: Software for Fully Reproducible Automated Voice Analysis
David Feinberg

TRILLsson: Distilled Universal Paralinguistic Speech Representations
Joel Shor, Subhashini Venugopalan

Global Signal-to-noise Ratio Estimation Based on Multi-subband Processing Using Convolutional Neural Network
Nan LI, Meng Ge, Longbiao Wang, Masashi Unoki, Sheng Li, Jianwu Dang

A Sparsity-promoting Dictionary Model for Variational Autoencoders
Mostafa Sadeghi, Paul Magron

Deep Transductive Transfer Regression Network for Cross-Corpus Speech Emotion Recognition
Yan Zhao, Jincen Wang, Ru Ye, Yuan Zong, Wenming Zheng, Li Zhao

Audio Anti-spoofing Using Simple Attention Module and Joint Optimization Based on Additive Angular Margin Loss and Meta-learning
John H.L. Hansen, ZHENYU WANG

PEAF: Learnable Power Efficient Analog Acoustic Features for Audio Recognition
Boris Bergsma, Minhao Yang, Milos Cernak

Hybrid Handcrafted and Learnable Audio Representation for Analysis of Speech Under Cognitive and Physical Load
Gasser Elbanna, Alice Biryukov, Neil Scheidwasser-Clow, Lara Orlandic, Pablo Mainar, Mikolaj Kegler, Pierre Beckmann, Milos Cernak

Generative Data Augmentation Guided by Triplet Loss for Speech Emotion Recognition
Shijun Wang, Hamed Hemati, Jón Guðnason, Damian Borth

Learning neural audio features without supervision
Sarthak Yadav, Neil Zeghidour

Densely-connected Convolutional Recurrent Network for Fundamental Frequency Estimation in Noisy Speech
Yixuan Zhang, Heming Wang, DeLiang Wang

Predicting label distribution improves non-intrusive speech quality estimation
Abu Zaher Md Faridee, Hannes Gamper

Deep versus Wide: An Analysis of Student Architectures for Task-Agnostic Knowledge Distillation of Self-Supervised Speech Models
Takanori Ashihara, Takafumi Moriya, Kohei Matsuura, Tomohiro Tanaka

Dataset Pruning for Resource-constrained Spoofed Audio Detection
Abdul Hameed Azeemi, Ihsan Ayyub Qazi, Agha Ali Raza


Speech Synthesis: Linguistic Processing, Paradigms and Other Topics II


EdiTTS: Score-based Editing for Controllable Text-to-Speech
Jaesung Tae, Hyeongju Kim, Taesu Kim

Improving Mandarin Prosodic Structure Prediction with Multi-level Contextual Information
Jie Chen, Changhe Song, Deyi Tuo, Xixin Wu, Shiyin Kang, Zhiyong Wu, Helen Meng

SpeechPainter: Text-conditioned Speech Inpainting
Zalan Borsos, Matthew Sharifi, Marco Tagliasacchi

A polyphone BERT for Polyphone Disambiguation in Mandarin Chinese
Song Zhang, Ken Zheng, Xiaoxu Zhu, Baoxiang Li

Neural Lexicon Reader: Reduce Pronunciation Errors in End-to-end TTS by Leveraging External Textual Knowledge
Mutian He, Jingzhou Yang, Lei He, Frank Soong

ByT5 model for massively multilingual grapheme-to-phoneme conversion
Jian Zhu, Cong Zhang, David Jurgens

DocLayoutTTS: Dataset and Baselines for Layout-informed Document-level Neural Speech Synthesis
Puneet Mathur, Franck Dernoncourt, Quan Hung Tran, Jiuxiang Gu, Ani Nenkova, Vlad Morariu, Rajiv Jain, Dinesh Manocha

Mixed-Phoneme BERT: Improving BERT with Mixed Phoneme and Sup-Phoneme Representations for Text to Speech
Guangyan Zhang, Kaitao Song, Xu Tan, Daxin Tan, Yuzi Yan, Yanqing Liu, Gang Wang, Wei Zhou, Tao Qin, Tan Lee, Sheng Zhao

Unsupervised Text-to-Speech Synthesis by Unsupervised Automatic Speech Recognition
Junrui Ni, Liming Wang, Heting Gao, Kaizhi Qian, Yang Zhang, Shiyu Chang, Mark Hasegawa-Johnson

An Efficient and High Fidelity Vietnamese Streaming End-to-End Speech Synthesis
Tho Nguyen Duc Tran, The Chuong Chu, Vu Hoang, Trung Huu Bui, Hung Quoc Truong

Predicting pairwise preferences between TTS audio stimuli using parallel ratings data and anti-symmetric twin neural networks
Cassia Valentini-Botinhao, Manuel Sam Ribeiro, Oliver Watts, Korin Richmond, Gustav Eje Henter

An Automatic Soundtracking System for Text-to-Speech Audiobooks
Zikai Chen, Lin Wu, Junjie Pan, Xiang Yin

Environment Aware Text-to-Speech Synthesis
Daxin Tan, Guangyan Zhang, Tan Lee

SoundChoice: Grapheme-to-Phoneme Models with Semantic Disambiguation
Artem Ploujnikov, Mirco Ravanelli

Shallow Fusion of Weighted Finite-State Transducer and Language Model for Text Normalization
Evelina Bakhturina, Yang Zhang, Boris Ginsburg

Prosodic alignment for off-screen automatic dubbing
Yogesh Virkar, Marcello Federico, Robert Enyedi, Roberto Barra-Chicote

A Study of Modeling Rising Intonation in Cantonese Neural Speech Synthesis
Qibing Bai, Tom Ko, Yu Zhang

CAUSE: Crossmodal Action Unit Sequence Estimation from Speech
Hirokazu Kameoka, Takuhiro Kaneko, Shogo Seki, Kou Tanaka

Visualising Model Training via Vowel Space for Text-To-Speech Systems
Binu Nisal Abeysinghe, Jesin James, Catherine Watson, Felix Marattukalam


Other Topics in Speech Recognition


Binary Early-Exit Network for Adaptive Inference on Low-Resource Devices
Aaqib Saeed

Streaming Speaker-Attributed ASR with Token-Level Speaker Embeddings
Naoyuki Kanda, Jian Wu, Yu Wu, Xiong Xiao, Zhong Meng, Xiaofei Wang, Yashesh Gaur, Zhuo Chen, Jinyu Li, Takuya Yoshioka

Speaker consistency loss and step-wise optimization for semi-supervised joint training of TTS and ASR using unpaired text data
Naoki Makishima, Satoshi Suzuki, Atsushi Ando, Ryo Masumura

Audio-Visual Generalized Few-Shot Learning with Prototype-Based Co-Adaptation
Yi-Kai Zhang, Da-Wei Zhou, Han-Jia Ye, De-Chuan Zhan

Federated Domain Adaptation for ASR with Full Self-Supervision
Junteng Jia, Jay Mahadeokar, Weiyi Zheng, Yuan Shangguan, Ozlem Kalinli, Frank Seide

Augmented Adversarial Self-Supervised Learning for Early-Stage Alzheimer's Speech Detection
Longfei Yang, Wenqing Wei, Sheng Li, Jiyi Li, Takahiro Shinozaki

Extending RNN-T-based speech recognition systems with emotion and language classification
Zvi Kons, Hagai Aronowitz, Edmilson Morais, Matheus Damasceno, Hong-Kwang Kuo, Samuel Thomas, George Saon

Thutmose Tagger: Single-pass neural model for Inverse Text Normalization
Alexandra Antonova, Evelina Bakhturina, Boris Ginsburg

Leveraging Prosody for Punctuation Prediction of Spontaneous Speech
Yeonjin Cho, Sara Ng, Trang Tran, Mari Ostendorf

A Comparative Study on Speaker-attributed Automatic Speech Recognition in Multi-party Meetings
Fan Yu, Zhihao Du, ShiLiang Zhang, Yuxiao Lin, Lei Xie








Speech Processing & Measurement


Relationship between the acoustic time intervals and tongue movements of German diphthongs
Arne-Lukas Fietkau, Simon Stone, Peter Birkholz

Development of allophonic realization until adolescence: A production study of the affricate-fricative variation of /z/ among Japanese children
Sanae Matsui, Kyoji Iwamoto, Reiko Mazuka

Recurrent multi-head attention fusion network for combining audio and text for speech emotion recognition
Chung-Soo Ahn, Chamara Kasun, Sunil Sivadas, Jagath Rajapakse

Low-Level Physiological Implications of End-to-End Learning for Speech Recognition
Louise Coppieters de Gibson, Philip N. Garner

Idiosyncratic lingual articulation of American English /æ/ and /ɑ/ using network analysis
Carolina Lins Machado, Volker Dellwo, Lei He

Method for improving the word intelligibility of presented speech using bone-conduction headphones
Teruki Toya, Wenyu Zhu, Maori Kobayashi, Kenichi Nakamura, Masashi Unoki

Three-dimensional finite-difference time-domain acoustic analysis of simplified vocal tract shapes
Debasish Mohapatra, Mario Fleischer, Victor Zappi, Peter Birkholz, Sidney Fels

Speech imitation skills predict automatic phonetic convergence: a GMM-UBM study on L2
Dorina de Jong, Aldo Pastore, Noël Nguyen, Alessandro D'Ausilio

Self-supervised speech unit discovery from articulatory and acoustic features using VQ-VAE
Marc-Antoine Georges, Jean-Luc Schwartz, Thomas Hueber

Deep Speech Synthesis from Articulatory Representations
Peter Wu, Shinji Watanabe, Louis Goldstein, Alan W Black, Gopala Krishna Anumanchipalli

Orofacial somatosensory inputs in speech perceptual training modulate speech production
Monica Ashokumar, Jean-Luc Schwartz, Takayuki Ito


Speech Synthesis: Acoustic Modeling and Neural Waveform Generation I


Transfer Learning Framework for Low-Resource Text-to-Speech using a Large-Scale Unlabeled Speech Corpus
Minchan Kim, Myeonghun Jeong, Byoung Jin Choi, Sunghwan Ahn, Joun Yeop Lee, Nam Soo Kim

DRSpeech: Degradation-Robust Text-to-Speech Synthesis with Frame-Level and Utterance-Level Acoustic Representation Learning
Takaaki Saeki, Kentaro Tachibana, Ryuichi Yamamoto

MSR-NV: Neural Vocoder Using Multiple Sampling Rates
Kentaro Mitsui, Kei Sawada

SpecGrad: Diffusion Probabilistic Model based Neural Vocoder with Adaptive Noise Spectral Shaping
Yuma Koizumi, Heiga Zen, Kohei Yatabe, Nanxin Chen, Michiel Bacchiani

Bunched LPCNet2: Efficient Neural Vocoders Covering Devices from Cloud to Edge
Sangjun Park, Kihyun Choo, Joohyung Lee, Anton V. Porov, Konstantin Osipov, June Sig Sung

Hierarchical and Multi-Scale Variational Autoencoder for Diverse and Natural Non-Autoregressive Text-to-Speech
Jaesung Bae, Jinhyeok Yang, Taejun Bak, Young-Sun Joo

End-to-end LPCNet: A Neural Vocoder With Fully-Differentiable LPC Estimation
Krishna Subramani, Jean-Marc Valin, Umut Isik, Paris Smaragdis, Arvindh Krishnaswamy

EPIC TTS Models: Empirical Pruning Investigations Characterizing Text-To-Speech Models
Perry Lam, Huayun Zhang, Nancy Chen, Berrak Sisman

Fine-grained Noise Control for Multispeaker Speech Synthesis
Karolos Nikitaras, Georgios Vamvoukakis, Nikolaos Ellinas, Konstantinos Klapsas, Konstantinos Markopoulos, Spyros Raptis, June Sig Sung, Gunu Jho, Aimilios Chalamandaris, Pirros Tsiakoulis

WavThruVec: Latent speech representation as intermediate features for neural speech synthesis
Hubert Siuzdak, Piotr Dura, Pol van Rijn, Nori Jacoby

Fast Grad-TTS: Towards Efficient Diffusion-Based Speech Generation on CPU
Ivan Vovk, Tasnima Sadekova, Vladimir Gogoryan, Vadim Popov, Mikhail Kudinov, Jiansheng Wei

Simple and Effective Unsupervised Speech Synthesis
Alexander H. Liu, Cheng-I Lai, Wei-Ning Hsu, Michael Auli, Alexei Baevski, James Glass

Unified Source-Filter GAN with Harmonic-plus-Noise Source Excitation Generation
Reo Yoneyama, Yi-Chiao Wu, Tomoki Toda



Spatial Audio


Training Data Generation with DOA-based Selecting and Remixing for Unsupervised Training of Deep Separation Models
Hokuto Munakata, Ryu Takeda, Kazunori Komatani

Beam-Guided TasNet: An Iterative Speech Separation Framework with Multi-Channel Output
Hangting Chen, Yi Yang, Feng Dang, Pengyuan Zhang

Joint Estimation of Direction-of-Arrival and Distance for Arrays with Directional Sensors based on Sparse Bayesian Learning
Feifei Xiong, Pengyu Wang, Zhongfu Ye, Jinwei Feng

How to Listen? Rethinking Visual Sound Localization
Ho-Hsiang Wu, Magdalena Fuentes, Prem Seetharaman, Juan Pablo Bello

Small Footprint Neural Networks for Acoustic Direction of Arrival Estimation
Zhiheng Ouyang, Miao Wang, Wei-Ping Zhu

Multi-Modal Multi-Correlation Learning for Audio-Visual Speech Separation
Xiaoyu Wang, Xiangyu Kong, Xiulian Peng, Yan Lu

MIMO-DoAnet: Multi-channel Input and Multiple Outputs DoA Network with Unknown Number of Sound Sources
Haoran Yin, Meng Ge, Yanjie Fu, Gaoyan Zhang, Longbiao Wang, Lei Zhang, Lin Qiu, Jianwu Dang

Iterative Sound Source Localization for Unknown Number of Sources
Yanjie Fu, Meng Ge, Haoran Yin, Xinyuan Qian, Longbiao Wang, Gaoyan Zhang, Jianwu Dang

Distance-Based Sound Separation
Katharine Patterson, Kevin Wilson, Scott Wisdom, John R. Hershey

VCSE: Time-Domain Visual-Contextual Speaker Extraction Network
Junjie Li, Meng Ge, Zexu Pan, Longbiao Wang, Jianwu Dang

TRUNet: Transformer-Recurrent-U Network for Multi-channel Reverberant Sound Source Separation
Ali Aroudi, Stefan Uhlich, Marc Ferras Font


Single-channel Speech Enhancement II


PercepNet+: A Phase and SNR Aware PercepNet for Real-Time Speech Enhancement
Xiaofeng Ge, Jiangyu Han, Yanhua Long, Haixin Guan

Lightweight Full-band and Sub-band Fusion Network for Real Time Speech Enhancement
Zhuangqi Chen, Pingjian Zhang

Cross-Layer Similarity Knowledge Distillation for Speech Enhancement
Jiaming Cheng, Ruiyu Liang, Yue Xie, Li Zhao, Björn Schuller, Jie Jia, Yiyuan Peng

Spectro-Temporal SubNet for Real-Time Monaural Speech Denoising and Dereverberation
Feifei Xiong, Weiguang Chen, Pengyu Wang, Xiaofei Li, Jinwei Feng

CMGAN: Conformer-based Metric GAN for Speech Enhancement
Ruizhe Cao, Sherif Abdulatif, Bin Yang

Model Compression by Iterative Pruning with Knowledge Distillation and Its Application to Speech Enhancement
Zeyuan Wei, Li Hao, Xueliang Zhang

Single-channel speech enhancement using Graph Fourier Transform
Chenhui Zhang, Xiang Pan

Joint Optimization of the Module and Sign of the Spectral Real Part Based on CRN for Speech Denoising.
Zilu Guo, Xu Xu, Zhongfu Ye

Attentive Recurrent Network for Low-Latency Active Noise Control
Hao Zhang, Ashutosh Pandey, DeLiang Wang

Memory-Efficient Multi-Step Speech Enhancement with Neural ODE
Jen-Hung Huang, Chung-Hsien Wu

GLD-Net: Improving Monaural Speech Enhancement by Learning Global and Local Dependency Features with GLD Block
Xinmeng Xu, Yang Wang, Jie Jia, Binbin Chen, Jianjun Hao

Improving Visual Speech Enhancement Network by Learning Audio-visual Affinity with Multi-head Attention
Xinmeng Xu, Yang Wang, Jie Jia, Binbin Chen, Dejun Li

Speech Enhancement with Fullband-Subband Cross-Attention Network
Jun Chen, Wei Rao, Zilin Wang, Zhiyong Wu, Yannan Wang, Tao Yu, Shidong Shang, Helen Meng

OSSEM: one-shot speaker adaptive speech enhancement using meta learning
Cheng Yu, Szu-wei Fu, Tsun-An Hsieh, Yu Tsao, Mirco Ravanelli

Efficient Speech Enhancement with Neural Homomorphic Synthesis
Wenbin Jiang, Tao Liu, Kai Yu

Fast Real-time Personalized Speech Enhancement: End-to-End Enhancement Network (E3Net) and Knowledge Distillation
Manthan Thakker, Sefik Emre Eskimez, Takuya Yoshioka, Huaming Wang

Strategies to Improve Robustness of Target Speech Extraction to Enrollment Variations
Hiroshi Sato, Tsubasa Ochiai, Marc Delcroix, Keisuke Kinoshita, Takafumi Moriya, Naoki Makishima, Mana Ihori, Tomohiro Tanaka, Ryo Masumura


Novel Models and Training Methods for ASR II


FedNST: Federated Noisy Student Training for Automatic Speech Recognition
Haaris Mehmood, Agnieszka Dobrowolska, Karthikeyan Saravanan, Mete Ozay

SCaLa: Supervised Contrastive Learning for End-to-End Speech Recognition
Li Fu, Xiaoxiao Li, Runyu Wang, Lu Fan, Zhengchen Zhang, Meng Chen, Youzheng Wu, Xiaodong He

NAS-SCAE: Searching Compact Attention-based Encoders For End-to-end Automatic Speech Recognition
Yukun Liu, Ta Li, Pengyuan Zhang, Yonghong Yan

Leveraging Acoustic Contextual Representation by Audio-textual Cross-modal Learning for Conversational ASR
Kun Wei, Yike Zhang, Sining Sun, Lei Xie, Long Ma

PM-MMUT: Boosted Phone-mask Data Augmentation using Multi-Modeling Unit Training for Phonetic-Reduction-Robust E2E Speech Recognition
Guodong Ma, Pengfei Hu, Nurmemet Yolwas, Shen Huang, Hao Huang

Analysis of Self-Attention Head Diversity for Conformer-based Automatic Speech Recognition
Kartik Audhkhasi, Yinghui Huang, Bhuvana Ramabhadran, Pedro J. Moreno

Improving Rare Word Recognition with LM-aware MWER Training
Wang Weiran, Tongzhou Chen, Tara Sainath, Ehsan Variani, Rohit Prabhavalkar, W. Ronny Huang, Bhuvana Ramabhadran, Neeraj Gaur, Sepand Mavandadi, Cal Peyser, Trevor Strohman, Yanzhang He, David Rybach

Improving the Training Recipe for a Robust Conformer-based Hybrid Model
Mohammad Zeineldeen, Jingjing Xu, Christoph Lüscher, Ralf Schlüter, Hermann Ney

CTC Variations Through New WFST Topologies
Aleksandr Laptev, Somshubra Majumdar, Boris Ginsburg

Dealing with Unknowns in Continual Learning for End-to-end Automatic Speech Recognition
Martin Sustek, Samik Sadhu, Hynek Hermansky

Towards Efficiently Learning Monotonic Alignments for Attention-based End-to-End Speech Recognition
Chenfeng Miao, Kun Zou, Ziyang Zhuang, Tao Wei, Jun Ma, Shaojun Wang, Jing Xiao

On monoaural speech enhancement for automatic recognition of real noisy speech using mixture invariant training
Jisi Zhang, Catalin Zorila, Rama Doddipatla, Jon Barker

From Undercomplete to Sparse Overcomplete Autoencoders to Improve LF-MMI based Speech Recognition
Selen Hande Kabil, Herve Bourlard

Domain Adversarial Self-Supervised Speech Representation Learning for Improving Unknown Domain Downstream Tasks
Tomohiro Tanaka, Ryo Masumura, Hiroshi Sato, Mana Ihori, Kohei Matsuura, Takanori Ashihara, Takafumi Moriya

Attention Weight Smoothing Using Prior Distributions for Transformer-Based End-to-End ASR
Takashi Maekaku, Yuya Fujita, Yifan Peng, Shinji Watanabe


Spoken Dialogue Systems and Multimodality


Reducing Offensive Replies in Open Domain Dialogue Systems
Naokazu Uchida, Takeshi Homma, Makoto Iwayama, Yasuhiro Sogawa

Induce Spoken Dialog Intents via Deep Unsupervised Context Contrastive Clustering
Ting-Wei Wu, Biing Juang

Dialogue Acts Aided Important Utterance Detection Based on Multiparty and Multimodal Information
Fumio Nihei, Ryo Ishii, Yukiko Nakano, Kyosuke Nishida, Ryo Masumura, Atsushi Fukayama, Takao Nakamura

Contextual Acoustic Barge-In Classification for Spoken Dialog Systems
Dhanush Bekal, Sundararajan Srinivasan, Srikanth Ronanki, Sravan Bodapati, Katrin Kirchhoff

Calibrate and Refine! A Novel and Agile Framework for ASR Error Robust Intent Detection
Peilin Zhou, Dading Chong, Helin Wang, Qingcheng Zeng

ASR-Robust Natural Language Understanding on ASR-GLUE dataset
Lingyun Feng, Jianwei Yu, Yan Wang, Songxiang Liu, Deng Cai, Haitao Zheng

From Disfluency Detection to Intent Detection and Slot Filling
Mai Hoang Dao, Thinh Truong, Dat Quoc Nguyen

Audio-Visual Wake Word Spotting in MISP2021 Challenge: Dataset Release and Deep Analysis
Hengshun Zhou, Jun Du, Gongzhen Zou, Zhaoxu Nian, Chin-Hui Lee, Sabato Marco Siniscalchi, Shinji Watanabe, Odette Scharenborg, Jingdong Chen, Shifu Xiong, Jian-Qing Gao

Extending Compositional Attention Networks for Social Reasoning in Videos
Christina Sartzetaki, Georgios Paraskevopoulos, Alexandros Potamianos

TopicKS: Topic-driven Knowledge Selection for Knowledge-grounded Dialogue Generation
Shiquan Wang, Yuke Si, Xiao Wei, Longbiao Wang, Zhiqiang Zhuang, Xiaowang Zhang, Jianwu Dang

Bottom-up discovery of structure and variation in response tokens (‘backchannels’) across diverse languages
Andreas Liesenfeld, Mark Dingemanse

Cross-modal Transfer Learning via Multi-grained Alignment for End-to-End Spoken Language Understanding
Yi Zhu, Zexun Wang, Hang Liu, Peiying Wang, Mingchao Feng, Meng Chen, Xiaodong He

Use of Nods Less Synchronized with Turn-Taking and Prosody During Conversations in Adults with Autism
Keiko Ochi, Nobutaka Ono, Keiho Owada, Kuroda Miho, Shigeki Sagayama, Hidenori Yamasue









Phonetics I


Gradual Improvements Observed in Learners' Perception and Production of L2 Sounds Through Continuing Shadowing Practices on a Daily Basis
Takuya Kunihara, Chuanbo Zhu, Nobuaki Minematsu, Noriko Nakanishi

Spoofed speech from the perspective of a forensic phonetician
Christin Kirchhübel, Georgina Brown

Investigating Prosodic Variation in British English Varieties using ProPer
Hae-Sung Jeon, Stephen Nichols

Perceived prominence and downstep in Japanese
Hyun Kyung Hwang, Manami Hirayama, Takaomi Kato

The discrimination of [zi]-[dʑi] by Japanese listeners and the prospective phonologization of /zi/
Andrea Alicehajic, Silke Hamann

Glottal inverse filtering based on articulatory synthesis and deep learning
Ingo Langheinrich, Simon Stone, Xinyu Zhang, Peter Birkholz

Investigating phonetic convergence of laughter in conversation
Bogdan Ludusan, Marin Schröer, Petra Wagner

Telling self-defining memories: An acoustic study of natural emotional speech productions
Veronique Delvaux, Audrey Lavallée, Fanny Degouis, Xavier Saloppe, Jean-Louis Nandrino, Thierry Pham

Voicing neutralization in Romanian fricatives across different speech styles
Laura Spinu, Ioana Vasilescu, Lori Lamel, Jason Lilley

Nasal Coda Loss in the Chengdu Dialect of Mandarin: Evidence from RT-MRI
Sishi Liao, Phil Hoole, Conceição Cunha, Esther Kunay, Aletheia Cui, Lia Saki Bučar Shigemori, Felicitas Kleber, Dirk Voit, Jens Frahm, Jonathan Harrington

ema2wav: doing articulation by Praat
Philipp Buech, Simon Roessig, Lena Pagel, Doris Muecke, Anne Hermes




Speaker Embedding and Diarization


PRISM: Pre-trained Indeterminate Speaker Representation Model for Speaker Diarization and Speaker Verification
Siqi Zheng, Hongbin Suo, Qian Chen

Cross-Age Speaker Verification: Learning Age-Invariant Speaker Embeddings
Xiaoyi Qin, Na Li, Weng Chao, Dan Su, Ming Li

Online Target Speaker Voice Activity Detection for Speaker Diarization
Weiqing Wang, Ming Li, Qingjian Lin

Probabilistic Spherical Discriminant Analysis: An Alternative to PLDA for length-normalized embeddings
Niko Brummer, Albert Swart, Ladislav Mosner, Anna Silnova, Oldrich Plchot, Themos Stafylakis, Lukas Burget

Deep speaker embedding with frame-constrained training strategy for speaker verification
Bin Gu

Interrelate Training and Searching: A Unified Online Clustering Framework for Speaker Diarization
Yifan Chen, Yifan Guo, Qingxuan Li, Gaofeng Cheng, Pengyuan Zhang, Yonghong Yan

End-to-End Audio-Visual Neural Speaker Diarization
Mao-Kui He, Jun Du, Chin-Hui Lee

Online Speaker Diarization with Core Samples Selection
Yanyan Yue, Jun Du, Mao-Kui He, YuTing Yeung, Renyu Wang

Robust End-to-end Speaker Diarization with Generic Neural Clustering
Chenyu Yang, Yu Wang

MSDWild: Multi-modal Speaker Diarization Dataset in the Wild
Tao Liu, Shuai Fan, Xu Xiang, Hongbo Song, Shaoxiong Lin, Jiaqi Sun, Tianyuan Han, Siyuan Chen, Binwei Yao, Sen Liu, Yifei Wu, Yanmin Qian, Kai Yu

Unsupervised Speaker Diarization that is Agnostic to Language, Overlap-Aware, and Tuning Free
Md Iftekhar Tanveer, Diego Casabuena, Jussi Karlgren, Rosie Jones

Utterance-by-utterance overlap-aware neural diarization with Graph-PIT
Keisuke Kinoshita, Thilo von Neumann, Marc Delcroix, Christoph Boeddeker, Reinhold Haeb-Umbach

Spatial-aware Speaker Diarizaiton for Multi-channel Multi-party Meeting
Jie Wang, Yuji Liu, Binling Wang, Yiming Zhi, Song Li, Shipeng Xia, Jiayang Zhang, Feng Tong, Lin Li, Qingyang Hong


Acoustic Event Detection and Classification


Selective Pseudo-labeling and Class-wise Discriminative Fusion for Sound Event Detection
Yunhao Liang, Yanhua Long, Yijie Li, Jiaen Liang

An End-to-End Macaque Voiceprint Verification Method Based on Channel Fusion Mechanism
Peng Liu, Songbin Li, Jigang Tang

Human Sound Classification based on Feature Fusion Method with Air and Bone Conducted Signal
Liang Xu, Jing Wang, Lizhong Wang, Sijun Bi, Jianqian Zhang, Qiuyue Ma

RaDur: A Reference-aware and Duration-robust Network for Target Sound Detection
Dongchao Yang, Helin Wang, Zhongjie Ye, Yuexian Zou, WenWu Wang

Temporal Self Attention-Based Residual Network for Environmental Sound Classification
Achyut Tripathi, Konark Paul

AudioTagging Done Right: 2nd comparison of deep learning methods for environmental sound classification
Juncheng Li, Shuhui Qu, Po-Yao Huang, Florian Metze

Improving Target Sound Extraction with Timestamp Information
Helin Wang, Dongchao Yang, Chao Weng, Jianwei Yu, Yuexian Zou

A Multi-grained based Attention Network for Semi-supervised Sound Event Detection
Ying Hu, Xiujuan Zhu, Yunlong Li, Hao Huang, Liang He

Temporal coding with magnitude-phase regularization for sound event detection
Sangwook Park, Sandeep Reddy Kothinti, Mounya Elhilali

RCT: Random consistency training for semi-supervised sound event detection
Nian Shao, Erfan Loweimi, Xiaofei Li

Audio Pyramid Transformer with Domain Adaption for Weakly Supervised Sound Event Detection and Audio Classification
Yifei Xin, Dongchao Yang, Yuexian Zou

Active Few-Shot Learning for Sound Event Detection
Yu Wang, Mark Cartwright, Juan Pablo Bello

Uncertainty Calibration for Deep Audio Classifiers
Tong Ye, Shijing Si, Jianzong Wang, Ning Cheng, Jing Xiao

Event-related data conditioning for acoustic event classification
Yuanbo Hou, Dick Botteldooren


Speech Synthesis: Acoustic Modeling and Neural Waveform Generation II


A Multi-Scale Time-Frequency Spectrogram Discriminator for GAN-based Non-Autoregressive TTS
Haohan Guo, Hui Lu, Xixin Wu, Helen Meng

RetrieverTTS: Modeling Decomposed Factors for Text-Based Speech Insertion
Dacheng Yin, Chuanxin Tang, Yanqing Liu, Xiaoqiang Wang, Zhiyuan Zhao, Yucheng Zhao, Zhiwei Xiong, Sheng Zhao, Chong Luo

FlowVocoder: A small Footprint Neural Vocoder based Normalizing Flow for Speech Synthesis
Manh Luong, Viet Anh Tran

DelightfulTTS 2: End-to-End Speech Synthesis with Adversarial Vector-Quantized Auto-Encoders
Yanqing Liu, Ruiqing Xue, Lei He, Xu Tan, Sheng Zhao

AdaVocoder: Adaptive Vocoder for Custom Voice
Xin Yuan, Robin Feng, Mingming Ye, Cheng Tuo, Minhang Zhang

RefineGAN: Universally Generating Waveform Better than Ground Truth with Highly Accurate Pitch and Intensity Responses
Shengyuan Xu, Wenxiao Zhao, Jing Guo

VQTTS: High-Fidelity Text-to-Speech Synthesis with Self-Supervised VQ Acoustic Feature
Chenpeng Du, Yiwei Guo, Xie Chen, Kai Yu

Improving GAN-based vocoder for fast and high-quality speech synthesis
He Mengnan, Tingwei Guo, Zhenxing Lu, Zhang Ruixiong, Gong Caixia

SoftSpeech: Unsupervised Duration Model in FastSpeech 2
Yuan-Hao Yi, Lei He, Shifeng Pan, Xi Wang, Yuchao Zhang

A Multi-Stage Multi-Codebook VQ-VAE Approach to High-Performance Neural TTS
Haohan Guo, Feng-Long Xie, Frank Soong, Xixin Wu, Helen Meng

SiD-WaveFlow: A Low-Resource Vocoder Independent of Prior Knowledge
Yuhan Li, Ying Shen, Dongqing Wang, Lin Zhang

Text-to-speech synthesis using spectral modeling based on non-negative autoencoder
Takeru Gorai, Daisuke Saito, Nobuaki Minematsu

Joint Modeling of Multi-Sample and Subband Signals for Fast Neural Vocoding on CPU
Hiroki Kanagawa, Yusuke Ijima, Hiroyuki Toda

MISRNet: Lightweight Neural Vocoder Using Multi-Input Single Shared Residual Blocks
Takuhiro Kaneko, Hirokazu Kameoka, Kou Tanaka, Shogo Seki

A compact transformer-based GAN vocoder
Chenfeng Miao, Ting Chen, Minchuan Chen, Jun Ma, Shaojun Wang, Jing Xiao

Diffusion Generative Vocoder for Fullband Speech Synthesis Based on Weak Third-order SDE Solver
Hideyuki Tachibana, Muneyoshi Inahara, Mocho Go, Yotaro Katayama, Yotaro Watanabe


ASR: Architecture and Search


On Adaptive Weight Interpolation of the Hybrid Autoregressive Transducer
Ehsan Variani, Michael Riley, David Rybach, Cyril Allauzen, Tongzhou Chen, Bhuvana Ramabhadran

Learning to rank with BERT-based confidence models in ASR rescoring
Ting-Wei Wu, I-Fan Chen, Ankur Gandhe

VQ-T: RNN Transducers using Vector-Quantized Prediction Network States
Jiatong Shi, George Saon, David Haws, Shinji Watanabe, Brian Kingsbury

WeNet 2.0: More Productive End-to-End Speech Recognition Toolkit
Binbin Zhang, Di Wu, Zhendong Peng, Xingchen Song, Zhuoyuan Yao, Hang Lv, Lei Xie, Chao Yang, Fuping Pan, Jianwei Niu

Internal Language Model Estimation Through Explicit Context Vector Learning for Attention-based Encoder-decoder ASR
Yufei Liu, Rao Ma, Haihua Xu, Yi He, Zejun Ma, Weibin Zhang

Improving Streaming End-to-End ASR on Transformer-based Causal Models with Encoder States Revision Strategies
Zehan Li, Haoran Miao, Keqi Deng, Gaofeng Cheng, Sanli Tian, Ta Li, Yonghong Yan

Parameter-Efficient Conformers via Sharing Sparsely-Gated Experts for End-to-End Speech Recognition
Ye Bai, Jie Li, Wenjing Han, Hao Ni, Kaituo Xu, Zhuo Zhang, Cheng Yi, Xiaorui Wang

CaTT-KWS: A Multi-stage Customized Keyword Spotting Framework based on Cascaded Transducer-Transformer
Zhanheng Yang, Sining Sun, Jin Li, Xiaoming Zhang, Xiong Wang, Long Ma, Lei Xie

LightHuBERT: Lightweight and Configurable Speech Representation Learning with Once-for-All Hidden-Unit BERT
Rui Wang, Qibing Bai, Junyi Ao, Long Zhou, Zhixiang Xiong, Zhihua Wei, Yu Zhang, Tom Ko, Haizhou Li

Multi-stage Progressive Compression of Conformer Transducer for On-device Speech Recognition
Jash Rathod, Nauman Dawalatabad, SHATRUGHAN SINGH, Dhananjaya Gowda

Streaming Align-Refine for Non-autoregressive Deliberation
Wang Weiran, Ke Hu, Tara Sainath

Federated Pruning: Improving Neural Network Efficiency with Federated Learning
Rongmei Lin, Yonghui Xiao, Tien-Ju Yang, Ding Zhao, Li Xiong, Giovanni Motta, Francoise Beaufays

A Unified Cascaded Encoder ASR Model for Dynamic Model Sizes
Shaojin Ding, Wang Weiran, Ding Zhao, Tara Sainath, Yanzhang He, Robert David, Rami Botros, Xin Wang, Rina Panigrahy‎, Qiao Liang, Dongseong Hwang, Ian McGraw, Rohit Prabhavalkar, Trevor Strohman

4-bit Conformer with Native Quantization Aware Training for Speech Recognition
Shaojin Ding, Phoenix Meadowlark‎, Yanzhang He, Lukasz Lew, Shivani Agrawal, Oleg Rybakov‎

Self-Distillation Based on High-level Information Supervision for Compressing End-to-End ASR Model
Qiang Xu, Tongtong Song, Longbiao Wang, Hao Shi, Yuqin Lin, Yongjie Lv, Meng Ge, Qiang Yu, Jianwu Dang


Spoken Language Processing II


Leveraging unsupervised and weakly-supervised data to improve direct speech-to-speech translation
Ye Jia, Yifan Ding, Ankur Bapna, Colin Cherry, Yu Zhang, Alexis Conneau, Nobu Morioka

A High-Quality and Large-Scale Dataset for English-Vietnamese Speech Translation
Linh The Nguyen, Nguyen Luong Tran, Long Doan, Manh Luong, Dat Quoc Nguyen

Investigating Parameter Sharing in Multilingual Speech Translation
Qian Wang, Chen Wang, Jiajun Zhang

Open Source MagicData-RAMC: A Rich Annotated Mandarin Conversational(RAMC) Speech Dataset
Zehui Yang, Yifan Chen, Lei Luo, Runyan Yang, Lingxuan Ye, Gaofeng Cheng, Ji Xu, Yaohui Jin, Qingqing Zhang, Pengyuan Zhang, Lei Xie, Yonghong Yan

TALCS: An open-source Mandarin-English code-switching corpus and a speech recognition baseline
Chengfei Li, Shuhao Deng, Yaoping Wang, Guangjing Wang, Yaguang Gong, Changbin Chen, Jinfeng Bai

Blockwise Streaming Transformer for Spoken Language Understanding and Simultaneous Speech Translation
Keqi Deng, Shinji Watanabe, Jiatong Shi, Siddhant Arora

BARTpho: Pre-trained Sequence-to-Sequence Models for Vietnamese
Nguyen Luong Tran, Duong Le, Dat Quoc Nguyen

Biometric Russian Audio-Visual Extended MASKS (BRAVE-MASKS) Corpus: Multimodal Mask Type Recognition Task
Maxim Markitantov, Elena Ryumina, Dmitry Ryumin, Alexey Karpov

Bayesian Transformer Using Disentangled Mask Attention
Jen-Tzung Chien, Yu-Han Huang

Audio-Visual Speech Recognition in MISP2021 Challenge: Dataset Release and Deep Analysis
Hang Chen, Jun Du, Yusheng Dai, Chin-Hui Lee, Sabato Marco Siniscalchi, Shinji Watanabe, Odette Scharenborg, Jingdong Chen, Baocai Yin, Jia Pan

From Start to Finish: Latency Reduction Strategies for Incremental Speech Synthesis in Simultaneous Speech-to-Speech Translation
Danni Liu, Changhan Wang, Hongyu Gong, Xutai Ma, Yun Tang, Juan Pino

Isochrony-Aware Neural Machine Translation for Automatic Dubbing
Derek Tam, Surafel M. Lakew, Yogesh Virkar, Prashant Mathur, Marcello Federico

Leveraging Pseudo-labeled Data to Improve Direct Speech-to-Speech Translation
Qianqian Dong, Fengpeng Yue, Tom Ko, Mingxuan Wang, Qibing Bai, Yu Zhang









Multimodal Speech Emotion Recognition and Paralinguistics


Differential Time-frequency Log-mel Spectrogram Features for Vision Transformer Based Infant Cry Recognition
Hai-tao Xu, Jie Zhang, Li-rong Dai

Towards Automated Dialog Personalization using MBTI Personality Indicators
Daniel Fernau, Stefan Hillmann, Nils Feldhus, Tim Polzehl

Word-wise Sparse Attention for Multimodal Sentiment Analysis
Fan Qian, Hongwei Song, Jiqing Han

Estimation of speaker age and height from speech signal using bi-encoder transformer mixture model
Tarun Gupta, Tuan Duc Truong, Tran The Anh, Eng Siong Chng

Exploring Multi-task Learning Based Gender Recognition and Age Estimation for Class-imbalanced Data
Weiqiao Zheng, Ping Yang, Rongfeng Lai, Kongyang Zhu, Tao Zhang, Junpeng Zhang, Hongcheng Fu

Audio-Visual Domain Adaptation Feature Fusion for Speech Emotion Recognition
Jie Wei, Guanyu Hu, Xinyu Yang, Anh Tuan Luu, Yizhuo Dong

Impact of Background Noise and Contribution of Visual Information in Emotion Identification by Native Mandarin Speakers
Minyue Zhang, Hongwei Ding

Exploiting Fine-tuning of Self-supervised Learning Models for Improving Bi-modal Sentiment Analysis and Emotion Recognition
Wei Yang, Satoru Fukayama, Panikos Heracleous, Jun Ogata

Characterizing Therapist's Speaking Style in Relation to Empathy in Psychotherapy
Dehua Tao, Tan Lee, Harold Chui, Sarah Luk

Hierarchical Attention Network for Evaluating Therapist Empathy in Counseling Session
Dehua Tao, Tan Lee, Harold Chui, Sarah Luk

Context-aware Multimodal Fusion for Emotion Recognition
Jinchao Li, Shuai Wang, Yang Chao, Xunying Liu, Helen Meng

Unsupervised Instance Discriminative Learning for Depression Detection from Speech Signals
Jinhan Wang, Vijay Ravi, Jonathan Flint, Abeer Alwan

How do our eyebrows respond to masks and whispering? The case of Persians
Nasim Mahdinazhad Sardhaei, Marzena Zygis, Hamid Sharifzadeh

State & Trait Measurement from Nonverbal Vocalizations: A Multi-Task Joint Learning Approach
Alice Baird, Panagiotis Tzirakis, Jeff Brooks, Lauren Kim, Michael Opara, Chris Gregory, Jacob Metrick, Garrett Boseck, Dacher Keltner, Alan Cowen

Confidence Measure for Automatic Age Estimation From Speech
Amruta Saraf, Ganesh Sivaraman, Elie Khoury


Neural Transducers, Streaming ASR and Novel ASR Models


Accelerating Inference and Language Model Fusion of Recurrent Neural Network Transducers via End-to-End 4-bit Quantization
Andrea Fasoli, Chia-Yu Chen, Mauricio Serrano, Swagath Venkataramani, George Saon, Xiaodong Cui, Brian Kingsbury, Kailash Gopalakrishnan

Tree-constrained Pointer Generator with Graph Neural Network Encodings for Contextual Speech Recognition
Guangzhi Sun, Chao Zhang, Phil Woodland

Bring dialogue-context into RNN-T for streaming ASR
junfeng Hou, Jinkun Chen, Wanyu Li, Yufeng Tang, Jun Zhang, Zejun Ma

Conformer with dual-mode chunked attention for joint online and offline ASR
Felix Weninger, Marco Gaudesi, Md Akmal Haidar, Nicola Ferri, Jesús Andrés-Ferrer, Puming Zhan

Efficient Training of Neural Transducer for Speech Recognition
Wei Zhou, Wilfried Michel, Ralf Schlüter, Hermann Ney

Paraformer: Fast and Accurate Parallel Transformer for Non-autoregressive End-to-End Speech Recognition
Zhifu Gao, ShiLiang Zhang, Ian McLoughlin, Zhijie Yan

Pruned RNN-T for fast, memory-efficient ASR training
Fangjun Kuang, Liyong Guo, Wei Kang, Long Lin, Mingshuang Luo, Zengwei Yao, Daniel Povey

Deep Sparse Conformer for Speech Recognition
Xianchao Wu

Chain-based Discriminative Autoencoders for Speech Recognition
Hung-Shin Lee, Pin-Tuan Huang, Yao-Fei Cheng, Hsin-Min Wang

Streaming parallel transducer beam search with fast slow cascaded encoders
Jay Mahadeokar, Yangyang Shi, Ke Li, Duc Le, Jiedan Zhu, Vikas Chandra, Ozlem Kalinli, Michael Seltzer

Self-regularised Minimum Latency Training for Streaming Transformer-based Speech Recognition
Mohan Li, Rama Sanand Doddipatla, Catalin Zorila

On the Prediction Network Architecture in RNN-T for ASR
Dario Albesano, Jesús Andrés-Ferrer, Nicola Ferri, Puming Zhan

Minimum latency training of sequence transducers for streaming end-to-end speech recognition
Yusuke Shinohara, Shinji Watanabe

CUSIDE: Chunking, Simulating Future Context and Decoding for Streaming ASR
Keyu An, Huahuan Zheng, Zhijian Ou, Hongyu Xiang, Ke Ding, Guanglu Wan

Attention Enhanced Citrinet for Speech Recognition
Xianchao Wu



Atypical Speech Analysis and Detection


Adversarial-Free Speaker Identity-Invariant Representation Learning for Automatic Dysarthric Speech Classification
Parvaneh Janbakhshi, Ina Kodrasi

Automated Detection of Wilson’s Disease Based on Improved Mel-frequency Cepstral Coefficients with Signal Decomposition
Zhenglin Zhang, Li-Zhuang Yang, Xun Wang, Hai Li

The effect of backward noise on lexical tone discrimination in Mandarin-speaking amusics
Zixia Fan, Jing Shao, Weigong Pan, Min Xu, Lan Wang

Automatic Selection of Discriminative Features for Dementia Detection in Cantonese-Speaking People
Xiaoquan KE, Man-Wai Mak, Helen M. Meng

Automated Voice Pathology Discrimination from Continuous Speech Benefits from Analysis by Phonetic Context
Zhuoya Liu, Mark Huckvale, Julian McGlashan

Multi-Type Outer Product-Based Fusion of Respiratory Sounds for Detecting COVID-19
Adria Mallol-Ragolta, Helena Cuesta, Emilia Gomez, Björn Schuller

Robust Cough Feature Extraction and Classification Method for COVID-19 Cough Detection Based on Vocalization Characteristics
Xueshuai Zhang, Jiakun Shen, Jun Zhou, Pengyuan Zhang, Yonghong Yan, Zhihua Huang, Yanfen Tang, Yu Wang, Fujie Zhang, Shaoxing Zhang, Aijun Sun

Comparing 1-dimensional and 2-dimensional spectral feature representations in voice pathology detection using machine learning and deep learning classifiers
Farhad Javanmardi, Sudarsana Reddy Kadiri, Manila Kodali, Paavo Alku

Zero-Shot Cross-lingual Aphasia Detection using Automatic Speech Recognition
Gerasimos Chatzoudis, Manos Plitsis, Spyridoula Stamouli, Athanasia–Lida Dimou, Nassos Katsamanis, Vassilis Katsouros

Domain-aware Intermediate Pretraining for Dementia Detection with Limited Data
Youxiang Zhu, Xiaohui Liang, John A. Batsis, Robert M. Roth

Comparison of 5 methods for the evaluation of intelligibility in mild to moderate French dysarthric speech
Cécile Fougeron, Nicolas Audibert, Ina Kodrasi, Parvaneh Janbakhshi, Michaela Pernon, Nathalie Leveque, Stephanie Borel, Marina Laganaro, Herve Bourlard, Frederic Assal








Speech Synthesis: Tools, Data, and Evaluation


Automatic Evaluation of Speaker Similarity
Kamil Deja, Ariadna Sanchez, Julian Roth, Marius Cotescu

Mix and Match: An Empirical Study on Training Corpus Composition for Polyglot Text-To-Speech (TTS)
Ziyao Zhang, Alessio Falai, Ariadna Sanchez, Orazio Angelini, Kayoko Yanagisawa

J-MAC: Japanese multi-speaker audiobook corpus for speech synthesis
Shinnosuke Takamichi, Wataru Nakata, Naoko Tanji, Hiroshi Saruwatari

REYD – The First Yiddish Text-to-Speech Dataset and System
Jacob Webber, Samuel K. Lo, Isaac L. Bleaman

Data-augmented cross-lingual synthesis in a teacher-student framework
Marcel de Korte, Jaebok Kim, Aki Kunikoshi, Adaeze Adigwe, Esther Klabbers

Production characteristics of obstruents in WaveNet and older TTS systems
Ayushi Pandey, Sébastien Le Maguer, Julie Carson-Berndsen, Naomi Harte

Back to the Future: Extending the Blizzard Challenge 2013
Sébastien Le Maguer, Simon King, Naomi Harte

BibleTTS: a large, high-fidelity, multilingual, and uniquely African speech corpus
Josh Meyer, David Adelani, Edresson Casanova, Alp Öktem, Daniel Whitenack, Julian Weber, Salomon KABONGO KABENAMUALU, Elizabeth Salesky, Iroro Orife, Colin Leong, Perez Ogayo, Chris Chinenye Emezue, Jonathan Mukiibi, Salomey Osei, Apelete AGBOLO, Victor Akinode, Bernard Opoku, Olanrewaju Samuel, Jesujoba Alabi, Shamsuddeen Hassan Muhammad

SOMOS: The Samsung Open MOS Dataset for the Evaluation of Neural Text-to-Speech Synthesis
Georgia Maniati, Alexandra Vioni, Nikolaos Ellinas, Karolos Nikitaras, Konstantinos Klapsas, June Sig Sung, Gunu Jho, Aimilios Chalamandaris, Pirros Tsiakoulis



Speech and Language in Health: From Remote Monitoring to Medical Conversations II


What can Speech and Language Tell us About the Working Alliance in Psychotherapy
Sebastian Peter Bayerl, Gabriel Roccabruna, Shammur Absar Chowdhury, Tommaso Ciulli, Morena Danieli, Korbinian Riedhammer, Giuseppe Riccardi

TB or not TB? Acoustic cough analysis for tuberculosis classification
Geoffrey T. Frost, Grant Theron, Thomas Niesler

Are reported accuracies in the clinical speech machine learning literature overoptimistic?
Visar Berisha, Chelsea Krantsevich, Gabriela Stegmann, Shira Hahn, Julie Liss

Automatic Detection of Expressed Emotion from Five-Minute Speech Samples: Challenges and Opportunities
Bahman Mirheidari, Andre Bittar, Nicholas Cummins, Johnny Downs, Helen L. Fisher, Heidi Christensen

Automatic cognitive assessment: Combining sparse datasets with disparate cognitive scores
Bahman Mirheidari, Daniel Blackburn, Heidi Christensen

Exploring Semi-supervised Learning for Audio-based COVID-19 Detection using FixMatch
Ting Dang, Thomas Quinnell, Cecilia Mascolo

Analyzing the impact of SARS-CoV-2 variants on respiratory sound signals
Debarpan Bhattacharya, Debottam Dutta, Neeraj Sharma, Srikanth Raj Chetupalli, Pravin Mote, Sriram Ganapathy, Chandrakiran C, Sahiti Nori, Suhail K K, Sadhana Gonuguntla, Murali Alagesan

Automated Evaluation of Standardized Dementia Screening Tests
Franziska Braun, Markus Förstel, Bastian Oppermann, Andreas Erzigkeit, Hartmut Lehfeld, Thomas Hillemacher, Korbinian Riedhammer

Alzheimer's Detection from English to Spanish Using Acoustic and Linguistic Embeddings
Paula Andrea Pérez-Toro, Philipp Klumpp, Abner Hernandez, Tomas Arias, Patricia Lillo, Andrea Slachevsky, Adolfo Martín García, Maria Schuster, Andreas K. Maier, Elmar Noeth, Juan Rafael Orozco-Arroyave

Extract and Abstract with BART for Clinical Notes from Doctor-Patient Conversations
Jing Su, Longxiang Zhang, Hamid Reza Hassanzadeh, Thomas Schaaf

Dyadic Interaction Assessment from Free-living Audio for Depression Severity Assessment
Bishal Lamichhane, Nidal Moukaddam, Ankit B. Patel, Ashutosh Sabharwal

COVID-19 detection based on respiratory sensing from speech
Venkata Srikanth Nallanthighal, Aki Harma, Helmer Strik



Voice Conversion and Adaptation III


Learning Noise-independent Speech Representation for High-quality Voice Conversion for Noisy Target Speakers
Liumeng Xue, Shan Yang, Na Hu, Dan Su, Lei Xie

Speech Representation Disentanglement with Adversarial Mutual Information Learning for One-shot Voice Conversion
SiCheng Yang, Methawee Tantrawenith, Haolin Zhuang, Zhiyong Wu, Aolan Sun, Jianzong Wang, Ning Cheng, Huaizhen Tang, Xintao Zhao, Jie Wang, Helen Meng

FlowCPCVC: A Contrastive Predictive Coding Supervised Flow Framework for Any-to-Any Voice Conversion
Jiahong Huang, Wen Xu, Yule Li, Junshi Liu, Dongpeng Ma, Wei Xiang

Glow-WaveGAN 2: High-quality Zero-shot Text-to-speech Synthesis and Any-to-any Voice Conversion
Yi Lei, Shan Yang, Jian Cong, Lei Xie, Dan Su

AdaSpeech 4: Adaptive Text to Speech in Zero-Shot Scenarios
Yihan Wu, Xu Tan, Bohan Li, Lei He, Sheng Zhao, Ruihua Song, Tao Qin, Tie-Yan Liu

Content-Dependent Fine-Grained Speaker Embedding for Zero-Shot Speaker Adaptation in Text-to-Speech Synthesis
Yixuan Zhou, Changhe Song, Xiang Li, Luwen Zhang, Zhiyong Wu, Yanyao Bian, Dan Su, Helen Meng

Streamable Speech Representation Disentanglement and Multi-Level Prosody Modeling for Live One-Shot Voice Conversion
Haoquan Yang, Liqun Deng, Yu Ting Yeung, Nianzu Zheng, Yong Xu

Accent Conversion using Pre-trained Model and Synthesized Data from Voice Conversion
Tuan Nam Nguyen, Ngoc-Quan Pham, Alexander Waibel

VoiceMe: Personalized voice generation in TTS
Pol van Rijn, Silvan Mertes, Dominik Schiller, Piotr Dura, Hubert Siuzdak, Peter M. C. Harrison, Elisabeth André, Nori Jacoby

DeID-VC: Speaker De-identification via Zero-shot Pseudo Voice Conversion
Ruibin Yuan, Yuxuan Wu, Jacob Li, Jaxter Kim

Towards Improved Zero-shot Voice Conversion with Conditional DSVAE
Jiachen Lian, Chunlei Zhang, Gopala Krishna Anumanchipalli, Dong Yu

Disentanglement of Emotional Style and Speaker Identity for Expressive Voice Conversion
Zongyang Du, Berrak Sisman, Kun Zhou, Haizhou Li


Novel Models and Training Methods for ASR III


Internal Language Model Adaptation with Text-Only Data for End-to-End Speech Recognition
Zhong Meng, Yashesh Gaur, Naoyuki Kanda, Jinyu Li, Xie Chen, Yu Wu, Yifan Gong

A Complementary Joint Training Approach Using Unpaired Speech and Text A Complementary Joint Training Approach Using Unpaired Speech and Text
Yeqian Du, Jie Zhang, Qiu-shi Zhu, Lirong Dai, MingHui Wu, Xin Fang, ZhouWang Yang

Knowledge Transfer and Distillation from Autoregressive to Non-Autoregessive Speech Recognition
Xun Gong, Zhikai Zhou, Yanmin Qian

Confidence Score Based Conformer Speaker Adaptation for Speech Recognition
Jiajun DENG, Xurong Xie, Tianzi Wang, Mingyu Cui, Boyang Xue, Zengrui Jin, Mengzhe Geng, Guinan Li, Xunying Liu, Helen Meng

Decoupled Federated Learning for ASR with Non-IID Data
Han Zhu, Jindong Wang, Gaofeng Cheng, Pengyuan Zhang, Yonghong Yan

Knowledge Distillation For CTC-based Speech Recognition Via Consistent Acoustic Representation Learning
Sanli Tian, Keqi Deng, Zehan Li, Lingxuan Ye, Gaofeng Cheng, Ta Li, Yonghong Yan

Improving Generalization of Deep Neural Network Acoustic Models with Length Perturbation and N-best Based Label Smoothing
Xiaodong Cui, George Saon, Tohru Nagano, Masayuki Suzuki, Takashi Fukuda, Brian Kingsbury, Gakuto Kurata

Supervision-Guided Codebooks for Masked Prediction in Speech Pre-training
Chengyi Wang, Yiming Wang, Yu Wu, Sanyuan Chen, Jinyu Li, Shujie Liu, Furu Wei

Speech Pre-training with Acoustic Piece
Shuo Ren, Shujie Liu, Yu Wu, Long Zhou, Furu Wei

Censer: Curriculum Semi-supervised Learning for Speech Recognition Based on Self-supervised Pre-training
Bowen Zhang, Songjun Cao, Xiaoming Xhang, Yike Zhang, Long Ma, Takahiro Shinozaki

Pre-Training Transformer Decoder for End-to-End ASR Model with Unpaired Speech Data
Junyi Ao, Ziqiang Zhang, Long Zhou, Shujie Liu, Haizhou Li, Tom Ko, Lirong Dai, Jinyu Li, Yao Qian, Furu Wei

PISA: PoIncaré Saliency-Aware Interpolative Augmentation
Ramit Sawhney, Megh Thakkar, Vishwa Shah, Puneet Mathur, Vasu Sharma, Dinesh Manocha

Online Continual Learning of End-to-End Speech Recognition Models
Muqiao Yang, Ian Lane, Shinji Watanabe

Streaming Target-Speaker ASR with Neural Transducer
Takafumi Moriya, Hiroshi Sato, Tsubasa Ochiai, Marc Delcroix, Takahiro Shinozaki

SPLICEOUT: A Simple and Efficient Audio Augmentation Method
Arjit Jain, Pranay Reddy Samala, Deepak Mittal, Preethi Jyothi, Maneesh Singh


Spoken Language Modeling and Understanding


Tokenwise Contrastive Pretraining for Finer Speech-to-BERT Alignment in End-to-End Speech-to-Intent Systems
Vishal Sunder, Eric Fosler-Lussier, Samuel Thomas, Hong-Kwang Kuo, Brian Kingsbury

Japanese ASR-Robust Pre-trained Language Model with Pseudo-Error Sentences Generated by Grapheme-Phoneme Conversion
Yasuhito Ohsugi, Itsumi Saito, Kyosuke Nishida, Sen Yoshida

Improving Spoken Language Understanding with Cross-Modal Contrastive Learning
Jingjing Dong, Jiayi Fu, Peng Zhou, Hao Li, Xiaorui Wang

Low-bit Shift Network for End-to-End Spoken Language Understanding
Anderson R. Avila, Khalil Bibi, Rui Heng Yang, Xinlin Li, Chao Xing, Xiao Chen

Meta Auxiliary Learning for Low-resource Spoken Language Understanding
Yingying Gao, Junlan Feng, Chao Deng, Shilei Zhang

Adversarial Knowledge Distillation For Robust Spoken Language Understanding
Ye Wang, Baishun Ling, Yanmeng Wang, Junhao Xue, Shaojun Wang, Jing Xiao

Incorporating Dual-Aware with Hierarchical Interactive Memory Networks for Task-Oriented Dialogue
yangyang Ou, Peng Zhang, Jing Zhang, Hui Gao, Xing Ma

Pay More Attention to History: A Context Modeling Strategy for Conversational Text-to-SQL
Yuntao Li, Hanchu Zhang, Yutian Li, Sirui Wang, Wei Wu, Yan Zhang

Small Changes Make Big Differences: Improving Multi-turn Response Selection in Dialogue Systems via Fine-Grained Contrastive Learning
Yuntao Li, Can Xu, Huang Hu, Lei Sha, Yan Zhang, Daxin Jiang

Toward Low-Cost End-to-End Spoken Language Understanding
Marco Dinarelli, Marco Naguib, François Portet

A Multi-Task BERT Model for Schema-Guided Dialogue State Tracking
Eleftherios Kapelonis, Efthymios Georgiou, Alexandros Potamianos

WavPrompt: Towards Few-Shot Spoken Language Understanding with Frozen Language Models
Heting Gao, Junrui Ni, Kaizhi Qian, Yang Zhang, Shiyu Chang, Mark Hasegawa-Johnson

Analysis of praising skills focusing on utterance contents
Asahi Ogushi, Toshiki Onishi, Yohei Tahara, Ryo Ishii, Atsushi Fukayama, Takao Nakamura, Akihiro Miyata

Speech2Slot: A Limited Generation Framework with Boundary Detection for Slot Filling from Speech
Pengwei Wang, Yinpei Su, Xiaohuan Zhou, Xin Ye, Liangchen Wei, Ming Liu, Yuan You, Feijun Jiang







Single-channel and multi-channel Speech Enhancement


tPLCnet: Real-time Deep Packet Loss Concealment in the Time Domain Using a Short Temporal Context
Nils L. Westhausen, Bernd T. Meyer

On the Role of Spatial, Spectral, and Temporal Processing for DNN-based Non-linear Multi-channel Speech Enhancement
Kristina Tesch, Nils-Hendrik Mohrmann, Timo Gerkmann

DDS: A new device-degraded speech dataset for speech enhancement
Haoyu Li, Junichi Yamagishi

Direction-Aware Joint Adaptation of Neural Speech Enhancement and Recognition in Real Multiparty Conversational Environments
Yicheng Du, Aditya Arie Nugraha, Kouhei Sekiguchi, Yoshiaki Bando, Mathieu Fontaine, Kazuyoshi Yoshii

Refining DNN-based Mask Estimation using CGMM-based EM Algorithm for Multi-channel Noise Reduction
Julitta Bartolewska, Stanisław Kacprzak, Konrad Kowalczyk

Speech Enhancement with Score-Based Generative Models in the Complex STFT Domain
Simon Welker, Julius Richter, Timo Gerkmann

Enhancing Embeddings for Speech Classification in Noisy Conditions
Mohamed Nabih Ali, Alessio Brutti, Falavigna Daniele

Deep Audio Waveform Prior
Arnon Turetzky, Tzvi Michelson, Yossi Adi, Shmuel Peleg

Convolutive Weighted Multichannel Wiener Filter Front-end for Distant Automatic Speech Recognition in Reverberant Multispeaker Scenarios
Mieszko Fras, Marcin Witkowski, Konrad Kowalczyk

Efficient Transformer-based Speech Enhancement Using Long Frames and STFT Magnitudes
Danilo de Oliveira, Tal Peer, Timo Gerkmann

Improving Speech Enhancement through Fine-Grained Speech Characteristics
Muqiao Yang, Joseph Konan, David Bick, Anurag Kumar, Shinji Watanabe, Bhiksha Raj


Voice Conversion and Adaptation II


Creating New Voices using Normalizing Flows
Piotr Bilinski, Thomas Merritt, Abdelhamid Ezzerg, Kamil Pokora, Sebastian Cygert, Kayoko Yanagisawa, Roberto Barra-Chicote, Daniel Korzekwa

Unify and Conquer: How Phonetic Feature Representation Affects Polyglot Text-To-Speech (TTS)
Ariadna Sanchez, Alessio Falai, Ziyao Zhang, Orazio Angelini, Kayoko Yanagisawa

Human-in-the-loop Speaker Adaptation for DNN-based Multi-speaker TTS
Kenta Udagawa, Yuki Saito, Hiroshi Saruwatari

GlowVC: Mel-spectrogram space disentangling model for language-independent text-free voice conversion
Magdalena Proszewska, Grzegorz Beringer, Daniel Sáez-Trigueros, Thomas Merritt, Abdelhamid Ezzerg, Roberto Barra-Chicote

One-Shot Speaker Adaptation Based on Initialization by Generative Adversarial Networks for TTS
Jaeuk Lee, Joon-Hyuk Chang

Zero-Shot Voice Conditioning for Denoising Diffusion TTS Models
Alon Levkovitch, Eliya Nachmani, Lior Wolf

Advanced Speaker Embedding with Predictive Variance of Gaussian Distribution for Speaker Adaptation in TTS
Jaeuk Lee, Joon-Hyuk Chang

Karaoker: Alignment-free singing voice synthesis with speech training data
Panagiotis Kakoulidis, Nikolaos Ellinas, Georgios Vamvoukakis, Konstantinos Markopoulos, June Sig Sung, Gunu Jho, Pirros Tsiakoulis, Aimilios Chalamandaris

ACNN-VC: Utilizing Adaptive Convolution Neural Network for One-Shot Voice Conversion
Ji Sub Um, Yeunju Choi, Hoi Rin Kim

A Unified System for Voice Cloning and Voice Conversion through Diffusion Probabilistic Modeling
Tasnima Sadekova, Vladimir Gogoryan, Ivan Vovk, Vadim Popov, Mikhail Kudinov, Jiansheng Wei

Adversarial Multi-Task Learning for Disentangling Timbre and Pitch in Singing Voice Synthesis
Tae-Woo Kim, Min-Su Kang, Gyeong-Hoon Lee

Leveraging Symmetrical Convolutional Transformer Networks for Speech to Singing Voice Style Transfer
Shrutina Agarwal, Naoya Takahashi, Sriram Ganapathy

Cross-Speaker Emotion Transfer for Low-Resource Text-to-Speech Using Non-Parallel Voice Conversion with Pitch-Shift Data Augmentation
Ryo Terashima, Ryuichi Yamamoto, Eunwoo Song, Yuma Shirahata, Hyun-Wook Yoon, Jae-Min Kim, Kentaro Tachibana



Speech Production, Perception and Multimodality


Perceptual Evaluation of Penetrating Voices through a Semantic Differential Method
Tatsuya Kitamura, Naoki Kunimoto, Hideki Kawahara, Shigeaki Amano

Non-native Perception of Japanese Singleton/Geminate Contrasts: Comparison of Mandarin and Mongolian Speakers Differing in Japanese Experience
Kimiko Tsukada, Yurong Yurong

Evaluating the effects of modified speech on perceptual speaker identification performance
Benjamin O'Brien, Christine Meunier, Alain Ghio

Mandarin Lombard Grid: a Lombard-grid-like corpus of Standard Chinese
Yuhong Yang, Xufeng Chen, Qingmu Liu, Weiping Tu, Hongyang Chen, Linjun Cai

Syllable sequence of /a/+/ta/ can be heard as /atta/ in Japanese with visual or tactile cues
Takayuki Arai, Miho Yamada, Megumi Okusawa

InQSS: a speech intelligibility and quality assessment model using a multi-task learning network
Yu-Wen Chen, Yu Tsao

Investigating the influence of personality on acoustic-prosodic entrainment
Andreas Weise, Rivka Levitan

Common and differential acoustic representation of interpersonal and tactile iconic perception of Mandarin vowels
Yi Li, Xiaoming Jiang

Effects of Noise on Speech Perception and Spoken Word Comprehension
Jovan Eranovic, Daniel Pape, Magda Stroińska, Elisabet Service, Marijana Matkovski

Acquisition of Two Consecutive Neutral Tones in Mandarin-Speaking Preschoolers: Phonological Representation and Phonetic Realization
Sichen Zhang, Aijun Li

Air tissue boundary segmentation using regional loss in real-time Magnetic Resonance Imaging video for speech production
Anwesha Roy, Varun Belagali, Prasanta Ghosh

Language-specific interactions of vowel discrimination in noise
Mark Gibson, Marcel Schlechtweg, Beatriz Blecua Falgueras, Judit Ayala Alcalde

An Improved Transformer Transducer Architecture for Hindi-English Code Switched Speech Recognition
Ansen Antony, Sumanth Reddy Kota, Akhilesh Lade, Spoorthy V, Shashidhar G. Koolagudi

VocaLiST: An Audio-Visual Synchronisation Model for Lips and Voices
Venkatesh Shenoy Kadandale, Juan F. Montesinos, Gloria Haro


Multi-, Cross-lingual and Other Topics in ASR II


Cross-Lingual Transfer Learning Approach to Phoneme Error Detection via Latent Phonetic Representation
Jovan M. Dalhouse, Katunobu Itou

Global RNN Transducer Models For Multi-dialect Speech Recognition
Takashi Fukuda, Samuel Thomas, Masayuki Suzuki, Gakuto Kurata, George Saon, Brian Kingsbury

Acoustic Stress Detection in Isolated English Words for Computer-Assisted Pronunciation Training
Vera Bernhard, Sandra Schwab, Jean-Philippe Goldman

On-the-fly ASR Corrections with Audio Exemplars
Golan Pundak, Tsendsuren Munkhdalai, Khe Chai Sim

FFM: A Frame Filtering Mechanism To Accelerate Inference Speed For Conformer In Speech Recognition
Zongfeng Quan, Nick J.C. Wang, Wei Chu, Tao Wei, Shaojun Wang, Jing Xiao

Two-pass Decoding and Cross-adaptation Based System Combination of End-to-end Conformer and Hybrid TDNN ASR Systems
Mingyu Cui, Jiajun DENG, Shoukang Hu, Xurong Xie, Tianzi Wang, Shujie HU, Mengzhe Geng, Boyang Xue, Xunying Liu, Helen Meng

Improving Recognition of Out-of-vocabulary Words in E2E Code-switching ASR by Fusing Speech Generation Methods
Lingxuan Ye, Gaofeng Cheng, Runyan Yang, Zehui Yang, Sanli Tian, Pengyuan Zhang, Yonghong Yan

Mitigating bias against non-native accents
Yuanyuan Zhang, Yixuan Zhang, Bence Halpern, Tanvina Patel, Odette Scharenborg

A Multi-level Acoustic Feature Extraction Framework for Transformer Based End-to-End Speech Recognition
Jin Li, Rongfeng Su, Xurong Xie, Lan Wang, Nan Yan

LAE: Language-Aware Encoder for Monolingual and Multilingual ASR
Jinchuan Tian, Jianwei Yu, Chunlei Zhang, Yuexian Zou, Dong Yu

Significance of single frequency filter for the development of children’s KWS system
Biswaranjan Pattanayak, Gayadhar Pradhan

A Language Agnostic Multilingual Streaming On-Device ASR System
Bo Li, Tara Sainath, Ruoming Pang, Shuo-Yiin Chang, Qiumin Xu, Trevor Strohman, Vince Chen, Qiao Liang, Heguang Liu, Yanzhang He, Parisa Haghani, Sameer Bidichandani

Minimizing Sequential Confusion Error in Speech Command Recognition
Zhanheng Yang, Hang Lv, Xiong Wang, Ao Zhang, Lei Xie

Homophone Disambiguation Profits from Durational Information
Barbara Schuppler, Emil Berger, Xenia Kogler, Franz Pernkopf

Speaker-Specific Utterance Ensemble based Transfer Attack on Speaker Identification
Chu-Xiao Zuo, Jia-Yi Leng, Wu-Jun Li

Complex Frequency Domain Linear Prediction: A Tool to Compute Modulation Spectrum of Speech
Samik Sadhu, Hynek Hermansky

Spectral Modification Based Data Augmentation For Improving End-to-End ASR For Children’s Speech
Vishwanath Pratap Singh, Hardik Sailor, Supratik Bhattacharya, Abhishek Pandey

End-to-End Joint Modeling of Conversation History-Dependent and Independent ASR Systems with Multi-History Training
Ryo Masumura, Yoshihiro Yamazaki, Saki Mizuno, Naoki Makishima, Mana Ihori, Mihiro Uchida, Hiroshi Sato, Tomohiro Tanaka, Akihiko Takashima, Satoshi Suzuki, Shota Orihashi, Takafumi Moriya, Nobukatsu Hojo, Atsushi Ando

Streaming End-to-End Multilingual Speech Recognition with Joint Language Identification
Chao Zhang, Bo Li, Tara Sainath, Trevor Strohman, Sepand Mavandadi, Shuo-Yiin Chang, Parisa Haghani


Spoken Language Processing III


An Anchor-Free Detector for Continuous Speech Keyword Spotting
Zhiyuan Zhao, Chuanxin Tang, Chengdong Yao, Chong Luo

Low-complex and Highly-performed Binary Residual Neural Network for Small-footprint Keyword Spotting
Xiao Wang, Song Cheng, Jun Li, Shushan Qiao, Yumei Zhou, Yi Zhan

UniKW-AT: Unified Keyword Spotting and Audio Tagging
Heinrich Dinkel, Yongqing Wang, Zhiyong Yan, Junbo Zhang, Yujun Wang

ESSumm: Extractive Speech Summarization from Untranscribed Meeting
Jun Wang

XTREME-S: Evaluating Cross-lingual Speech Representations
Alexis Conneau, Ankur Bapna, Yu Zhang, Min Ma, Patrick von Platen, Anton Lozhkov, Colin Cherry, Ye Jia, Clara Rivera, Mihir Kale, Daan van Esch, Vera Axelrod, Simran Khanuja, Jonathan Clark, Orhan Firat, Michael Auli, Sebastian Ruder, Jason Riesa, Melvin Johnson

Negative Guided Abstractive Dialogue Summarization
Junpeng Liu, Yanyan Zou, Yuxuan Xi, Shengjie Li, Mian Ma, Zhuoye Ding, Bo Long

Exploring representation learning for small-footprint keyword spotting
Fan Cui, Liyong Guo, Quandong Wang, Peng Gao, Yujun Wang

Large-Scale Streaming End-to-End Speech Translation with Neural Transducers
Jian Xue, Peidong Wang, Jinyu Li, Matt Post, Yashesh Gaur

Phonetic Embedding for ASR Robustness in Entity Resolution
Xiaozhou Zhou, Ruying Bao, William M. Campbell

Hierarchical Tagger with Multi-task Learning for Cross-domain Slot Filling
Xiao Wei, Yuke Si, Shiquan Wang, Longbiao Wang, Jianwu Dang

Multi-class AUC Optimization for Robust Small-footprint Keyword Spotting with Limited Training Data
MengLong Xu, Shengqiang Li, Chengdong Liang, Xiao-Lei Zhang

Weak supervision for Question Type Detection with large language models
Jiřı́ Martı́nek, Christophe Cerisara, Pavel Kral, Ladislav Lenc, Josef Baloun


Non-intrusive Objective Speech Quality Assessment (NISQA) Challenge for Online Conferencing Applications


BIT-MI Deep Learning-based Model to Non-intrusive Speech Quality Assessment Challenge in Online Conferencing Applications
Miao Liu, Jing Wang, Liang Xu, Jianqian Zhang, Shicong Li, Fei Xiang

MOS Prediction Network for Non-intrusive Speech Quality Assessment in Online Conferencing
Wenjing Liu, Chuan Xie

Non-intrusive Speech Quality Assessment with a Multi-Task Learning based Subband Adaptive Attention Temporal Convolutional Neural Network
Xiaofeng Shu, Yanjie Chen, Chuxiang Shang, Yan Zhao, Chengshuai Zhao, Yehang Zhu, Chuanzeng Huang, Yuxuan Wang

Soft-label Learn for No-Intrusive Speech Quality Assessment
Junyong Hao, Shunzhou Ye, Cheng Lu, Fei Dong, Jingang Liu, Dong Pi

ConferencingSpeech 2022 Challenge: Non-intrusive Objective Speech Quality Assessment (NISQA) Challenge for Online Conferencing Applications
Gaoxiong Yi, Wei Xiao, Yiming Xiao, Babak Naderi, Sebastian Möller, Wafaa Wardah, Gabriel Mittag, Ross Culter, Zhuohuang Zhang, Donald S. Williamson, Fei Chen, Fuzheng Yang, Shidong Shang

MOSRA: Joint Mean Opinion Score and Room Acoustics Speech Quality Assessment
Karl El Hajal, Milos Cernak, Pablo Mainar

CCATMos: Convolutional Context-aware Transformer Network for Non-intrusive Speech Quality Assessment
Yuchen Liu, Li-Chia Yang, Alexander Pawlicki, Marko Stamenovic

Impairment Representation Learning for Speech Quality Assessment
Lianwu Chen, Xinlei Ren, Xu Zhang, Xiguang Zheng, Chen Zhang, Liang Guo, Bing Yu









Speech representation I


Audio Similarity is Unreliable as a Proxy for Audio Quality
Pranay Manocha, Zeyu Jin, Adam Finkelstein

Overlapped Frequency-Distributed Network: Frequency-Aware Voice Spoofing Countermeasure
Sunmook Choi, Il-Youp Kwak, Seungsang Oh

Formant Estimation and Tracking using Probabilistic Heat-Maps
Yosi Shrem, Felix Kreuk, Joseph Keshet

Anti-Spoofing Using Transfer Learning with Variational Information Bottleneck
Youngsik Eom, Yeonghyeon Lee, Ji Sub Um, Hoi Rin Kim

Robust Pitch Estimation Using Multi-Branch CNN-LSTM and 1-Norm LP Residual
Mudit D. Batra, JAYESH, C.S. Ramalingam

DeepFry: Identifying Vocal Fry Using Deep Neural Networks
Bronya Roni Chernyak, Talia Ben Simon, Yael Segal, Jeremy Steffman, Eleanor Chodroff, Jennifer Cole, Joseph Keshet

Phonetic Analysis of Self-supervised Representations of English Speech
Dan Wells, Hao Tang, Korin Richmond

FitHuBERT: Going Thinner and Deeper for Knowledge Distillation of Speech Self-Supervised Models
Yeonghyeon Lee, Kangwook Jang, Jahyun Goo, Youngmoon Jung, Hoi Rin Kim

On Combining Global and Localized Self-Supervised Models of Speech
Sri Harsha Dumpala, Chandramouli Shama Sastry, Rudolf Uher, Sageev Oore

Self-supervised Representation Fusion for Speech and Wearable Based Emotion Recognition
Vipula Dissanayake, Sachith Seneviratne, Hussel Suriyaarachchi, Elliott Wen, Suranga Nanayakkara

Towards Disentangled Speech Representations
Cal Peyser, W. Ronny Huang, Andrew Rosenberg, Tara Sainath, Michael Picheny, Kyunghyun Cho


Pathological Speech Assessment


Automatic Assessment of Speech Intelligibility using Consonant Similarity for Head and Neck Cancer
Sebastião Quintas, Julie Mauclair, Virginie Woisard, Julien Pinquier

Compensation in Verbal and Nonverbal Communication after Total Laryngectomy
Marise Neijman, Femke Hof, Noelle Oosterom, Roland Pfau, Bertus van Rooy, Rob J.J.H. van Son, Michiel M.W.M. van den Brekel

wav2vec2-based Speech Rating System for Children with Speech Sound Disorder
Yaroslav Getman, Ragheb Al-Ghezi, Katja Voskoboinik, Tamás Grósz, Mikko Kurimo, Giampiero Salvi, Torbjørn Svendsen, Sofia Strömbergsson

Distinguishing between pre- and post-treatment in the speech of patients with chronic obstructive pulmonary disease
Andreas Triantafyllopoulos, Markus Fendler, Anton Batliner, Maurice Gerczuk, Shahin Amiriparian, Thomas Berghaus, Björn W. Schuller

A Study on the Phonetic Inventory Development of Children with Cochlear Implants for 5 Years after Implantation
Seonwoo Lee, Sunhee Kim, Minhwa Chung

Evaluation of different antenna types and positions in a stepped frequency continuous-wave radar-based silent speech interface
Joao Vitor Menezes, Pouriya Amini Digehsara, Christoph Wagner, Marco Mütze, Michael Bärhold, Petr Schaffer, Dirk Plettemeier, Peter Birkholz

Validation of the Neuro-Concept Detector framework for the characterization of speech disorders: A comparative study including Dysarthria and Dysphonia
Sondes Abderrazek, Corinne Fredouille, Alain Ghio, Muriel Lalain, Christine Meunier, Virginie Woisard

Nonwords Pronunciation Classification in Language Development Tests for Preschool Children
Ilja Baumann, Dominik Wagner, Sebastian Bayerl, Tobias Bocklet

PERCEPT-R: An Open-Access American English Child/Clinical Speech Corpus Specialized for the Audio Classification of /ɹ/
Nina Benway, Jonathan L. Preston, Elaine Hitchcock, Asif Salekin, Harshit Sharma, Tara McAllister

Data Augmentation for End-to-end Silent Speech Recognition for Laryngectomees
Beiming Cao, Kristin Teplansky, Nordine Sebkhi, Arpan Bhavsar, Omer Inan, Robin Samlan, Ted Mau, Jun Wang

Statistical and clinical utility of multimodal dialogue-based speech and facial metrics for Parkinson's disease assessment
Hardik Kothare, Michael Neumann, Jackson Liscombe, Oliver Roesler, William Burke, Andrew Exner, Sandy Snyder, Andrew Cornish, Doug Habberstad, David Pautler, David Suendermann-Oeft, Jessica Huber, Vikram Ramanarayanan



Speaker and Language Recognition II


Classification of Accented English Using CNN Model Trained on Amplitude Mel-Spectrograms
Mariia Lesnichaia, Veranika Mikhailava, Natalia Bogach, Iurii Lezhenin, John Blake, Evgeny Pyshkin

MIM-DG: Mutual information minimization-based domain generalization for speaker verification
Woohyun Kang, Md Jahangir Alam, Abderrahim Fathan

Multi-Channel Far-Field Speaker Verification with Large-Scale Ad-hoc Microphone Arrays
Chengdong Liang, Yijiang Chen, Jiadi Yao, Xiao-Lei Zhang

Ant Multilingual Recognition System for OLR 2021 Challenge
Anqi Lyu, Zhiming Wang, Huijia Zhu

Class-Aware Distribution Alignment based Unsupervised Domain Adaptation for Speaker Verification
Hang-Rui Hu, Yan Song, Li-Rong Dai, Ian McLoughlin, Lin Liu

EDITnet: A Lightweight Network for Unsupervised Domain Adaptation in Speaker Verification
Jingyu Li, Wei Liu, Tan Lee

Why does Self-Supervised Learning for Speech Recognition Benefit Speaker Recognition?
Sanyuan Chen, Yu Wu, Chengyi Wang, Shujie Liu, Zhuo Chen, Peidong Wang, Gang Liu, Jinyu Li, Jian Wu, Xiangzhan Yu, Furu Wei

Audio Visual Multi-Speaker Tracking with Improved GCF and PMBM Filter
Jinzheng Zhao, Peipei Wu, Xubo Liu, Shidrokh Goudarzi, Haohe Liu, YONG XU, Wenwu Wang

The HCCL System for the NIST SRE21
Zhuo Li, Runqiu Xiao, Hangting Chen, Zhenduo Zhao, Zihan Zhang, Wenchao Wang

UNet-DenseNet for Robust Far-Field Speaker Verification
Zhenke Gao, Manwai Mak, Weiwei Lin

Linguistic-Acoustic Similarity Based Accent Shift for Accent Recognition
Qijie Shao, Jinghao Yan, Jian Kang, Pengcheng Guo, Xian Shi, Pengfei Hu, Lei Xie

Transducer-based language embedding for spoken language identification
Peng Shen, Xugang Lu, Hisashi Kawai

Oriental Language Recognition (OLR) 2021: Summary and Analysis
Binling Wang, Feng Wang, Wenxuan Hu, Qiulin Wang, Jing Li, Dong Wang, Lin Li, Qingyang Hong



Robust ASR, and Far-field/Multi-talker ASR


Streaming Multi-Talker ASR with Token-Level Serialized Output Training
Naoyuki Kanda, Jian Wu, Yu Wu, Xiong Xiao, Zhong Meng, Xiaofei Wang, Yashesh Gaur, Zhuo Chen, Jinyu Li, Takuya Yoshioka

pMCT: Patched Multi-Condition Training for Robust Speech Recognition
Pablo Peso Parada, Agnieszka Dobrowolska, Karthikeyan Saravanan, Mete Ozay

Improving ASR Robustness in Noisy Condition Through VAD Integration
Sashi Novitasari, Takashi Fukuda, Gakuto Kurata

Empirical Sampling from Latent Utterance-wise Evidence Model for Missing Data ASR based on Neural Encoder-Decoder Model
Ryu Takeda, Yui Sudo, Kazuhiro Nakadai, Kazunori Komatani

Coarse-Grained Attention Fusion With Joint Training Framework for Complex Speech Enhancement and End-to-End Speech Recognition
Xuyi Zhuang, Lu Zhang, Zehua Zhang, Yukun Qian, Mingjiang Wang

DENT-DDSP: Data-efficient noisy speech generator using differentiable digital signal processors for explicit distortion modelling and noise-robust speech recognition
Zixun Guo, Chen Chen, Eng Siong Chng

Improving Transformer-based Conversational ASR by Inter-Sentential Attention Mechanism
Kun Wei, Pengcheng Guo, Ning Jiang

Federated Self-supervised Speech Representations: Are We There Yet?
Yan Gao, Javier Fernandez-Marques, Titouan Parcollet, Abhinav Mehrotra, Nicholas Lane

Leveraging Real Conversational Data for Multi-Channel Continuous Speech Separation
Xiaofei Wang, Dongmei Wang, Naoyuki Kanda, Sefik Emre Eskimez, Takuya Yoshioka

End-to-End Integration of Speech Recognition, Speech Enhancement, and Self-Supervised Learning Representation
Xuankai Chang, Takashi Maekaku, Yuya Fujita, Shinji Watanabe

Weakly-Supervised Neural Full-Rank Spatial Covariance Analysis for a Front-End System of Distant Speech Recognition
Yoshiaki Bando, Takahiro Aizawa, Katsutoshi Itoyama, Kazuhiro Nakadai

A universally-deployable ASR frontend for joint acoustic echo cancellation, speech enhancement, and voice separation
Thomas R. O'Malley, Arun Narayanan, Quan Wang

Speaker conditioned acoustic modeling for multi-speaker conversational ASR
Srikanth Raj Chetupalli, Sriram Ganapathy

Hear No Evil: Towards Adversarial Robustness of Automatic Speech Recognition via Multi-Task Learning
Nilaksh Das, Polo Chau

Tandem Multitask Training of Speaker Diarisation and Speech Recognition for Meeting Transcription
Xianrui Zheng, Chao Zhang, Phil Woodland


ASR: Linguistic Components


Investigating the Impact of Crosslingual Acoustic-Phonetic Similarities on Multilingual Speech Recognition
Muhammad Umar Farooq, Thomas Hain

An Improved Deliberation Network with Text Pre-training for Code-Switching Automatic Speech Recognition
Zhijie Shen, Wu Guo

CyclicAugment: Speech Data Random Augmentation with Cosine Annealing Scheduler for Auotmatic Speech Recognition
Zhihan Wang, Feng Hou, Yuanhang Qiu, Zhizhong Ma, Satwinder Singh, Ruili Wang

Prompt-based Re-ranking Language Model for ASR
Mengxi Nie, Ming Yan, Caixia Gong

Avoid Overfitting User Specific Information in Federated Keyword Spotting
Xin-Chun Li, Jin-Lin Tang, Shaoming Song, Bingshuai Li, Yinchuan Li, Yunfeng Shao, Le Gan, De-Chuan Zhan

ASR Error Correction with Constrained Decoding on Operation Prediction
Jingyuan Yang, Rongjun Li, Wei Peng

Adaptive multilingual speech recognition with pretrained models
Ngoc-Quan Pham, Alexander Waibel, Jan Niehues

Vietnamese Capitalization and Punctuation Recovery Models
Hoang Thi Thu Uyen, Nguyen Anh Tu, Ta Duc Huy

Non-autoregressive Error Correction for CTC-based ASR with Phone-conditioned Masked LM
Hayato Futami, Hirofumi Inaguma, Sei Ueno, Masato Mimura, Shinsuke Sakai, Tatsuya Kawahara

reducing multilingual context confusion for end-to-end code-switching automatic speech recognition
Shuai Zhang, Jiangyan Yi, Zhengkun Tian, Jianhua Tao, Yu Ting Yeung, Liqun Deng

Residual Language Model for End-to-end Speech Recognition
Emiru Tsunoo, Yosuke Kashiwagi, Chaitanya Prasad Narisetty, Shinji Watanabe

An Empirical Study of Language Model Integration for Transducer based Speech Recognition
Huahuan Zheng, keyu An, Zhijian Ou, Chen Huang, Ke Ding, Guanglu Wan

Self-Normalized Importance Sampling for Neural Language Modeling
Zijian Yang, Yingbo Gao, Alexander Gerstenberger, Jintao Jiang, Ralf Schlüter, Hermann Ney

Improving Contextual Recognition of Rare Words with an Alternate Spelling Prediction Model
Jennifer Fox, Natalie Delworth

Effect and Analysis of Large-scale Language Model Rescoring on Competitive ASR Systems
Takuma Udagawa, Masayuki Suzuki, Gakuto Kurata, Nobuyasu Itoh, George Saon

Language-specific Characteristic Assistance for Code-switching Speech Recognition
Tongtong Song, Qiang Xu, Meng Ge, Longbiao Wang, Hao Shi, Yongjie Lv, Yuqin Lin, Jianwu Dang











Acoustic scene analysis


On Metric Learning for Audio-Text Cross-Modal Retrieval
Xinhao Mei, Xubo Liu, Jianyuan Sun, Mark Plumbley, Wenwu Wang

CT-SAT: Contextual Transformer for Sequential Audio Tagging
Yuanbo Hou, Zhaoyi Liu, Bo Kang, Yun Wang, Dick Botteldooren

ADFF: Attention Based Deep Feature Fusion Approach for Music Emotion Recognition
Zi Huang, Shulei Ji, Zhilan Hu, Chuangjian Cai, Jing Luo, Xinyu Yang

Audio-Visual Scene Classification Based on Multi-modal Graph Fusion
Han Lei, Ning Chen

MusicNet: Compact Convolutional Neural Network for Real-time Background Music Detection
Chandan Reddy, Vishak Gopal, Harishchandra Dubey, Ross Cutler, Sergiy Matusevych, Robert Aichner

iCNN-Transformer: An improved CNN-Transformer with Channel-spatial Attention and Keyword Prediction for Automated Audio Captioning
Kun Chen, Jun Wang, Feng Deng, Xiaorui Wang

ATST: Audio Representation Learning with Teacher-Student Transformer
Xian LI, Xiaofei Li

Deep Segment Model for Acoustic Scene Classification
Yajian Wang, Jun Du, Hang Chen, Qing Wang, Chin-Hui Lee

Novel Augmentation Schemes for Device Robust Acoustic Scene Classification
Sukanya Sonowal, Anish Tamse

WideResNet with Joint Representation Learning and Data Augmentation for Cover Song Identification
Shichao Hu, Bin Zhang, Jinhong Lu, Yiliang Jiang, Wucheng Wang, Lingcheng Kong, Weifeng Zhao, Tao Jiang

Impact of Acoustic Event Tagging on Scene Classification in a Multi-Task Learning Framework
Rahil Parikh, Harshavardhan Sundar, Ming Sun, Chao Wang, Spyros Matsoukas

Introducing Auxiliary Text Query-modifier to Content-based Audio Retrieval
Daiki Takeuchi, Yasunori Ohishi, Daisuke Niizumi, Noboru Harada, Kunio Kashino



Speech Synthesis: Singing, Multimodal, Crosslingual Synthesis


Opencpop: A High-Quality Open Source Chinese Popular Song Corpus for Singing Voice Synthesis
Yu Wang, Xinsheng Wang, Pengcheng Zhu, Jie Wu, Hanzhao Li, Heyang Xue, Yongmao Zhang, Lei Xie, Mengxiao Bi

Exploring Timbre Disentanglement in Non-Autoregressive Cross-Lingual Text-to-Speech
Haoyue Zhan, Xinyuan YU, Haitong Zhang, Yang Zhang, Yue Lin

WeSinger: Data-augmented Singing Voice Synthesis with Auxiliary Losses
Zewang Zhang, Yibin Zheng, Xinhui Li, Li Lu

Decoupled Pronunciation and Prosody Modeling in Meta-Learning-based Multilingual Speech Synthesis
Yukun Peng, Zhenhua Ling

KaraTuner: Towards End-to-End Natural Pitch Correction for Singing Voice in Karaoke
Xiaobin Zhuang, Huiran Yu, Weifeng Zhao, Tao Jiang, Peng Hu

Learn2Sing 2.0: Diffusion and Mutual Information-Based Target Speaker SVS by Learning from Singing Teacher
Heyang Xue, Xinsheng Wang, Yongmao Zhang, Lei Xie, Pengcheng Zhu, Mengxiao Bi

SingAug: Data Augmentation for Singing Voice Synthesis with Cycle-consistent Training Strategy
Shuai Guo, Jiatong Shi, Tao Qian, Shinji Watanabe, Qin Jin

Muskits: an End-to-end Music Processing Toolkit for Singing Voice Synthesis
Jiatong Shi, Shuai Guo, Tao Qian, Tomoki Hayashi, Yuning Wu, Fangzheng Xu, Xuankai Chang, Huazhe Li, Peter Wu, Shinji Watanabe, Qin Jin

Pronunciation Dictionary-Free Multilingual Speech Synthesis by Combining Unsupervised and Supervised Phonetic Representations
Chang Liu, Zhen-Hua Ling, Ling-Hui Chen

Towards high-fidelity singing voice conversion with acoustic reference and contrastive predictive coding
Chao Wang, Zhonghao Li, Benlai Tang, Xiang Yin, Yuan Wan, Yibiao Yu, Zejun Ma

Towards Improving the Expressiveness of Singing Voice Synthesis with BERT Derived Semantic Information
Shaohuan Zhou, Shun Lei, Weiya You, Deyi Tuo, Yuren You, Zhiyong Wu, Shiyin Kang, Helen Meng

Normalization of code-switched text for speech synthesis
Sreeram Manghat, Sreeja Manghat, Tanja Schultz

Synthesizing Near Native-accented Speech for a Non-native Speaker by Imitating the Pronunciation and Prosody of a Native Speaker
Raymond Chung, Brian Mak

A Hierarchical Speaker Representation Framework for One-shot Singing Voice Conversion
Xu Li, Shansong Liu, Ying Shan


Applications in Transcription, Education and Learning II


Self-Supervised Learning with Multi-Target Contrastive Coding for Non-Native Acoustic Modeling of Mispronunciation Verification
Longfei Yang, Jinsong Zhang, Takahiro Shinozaki

L2-GEN: A Neural Phoneme Paraphrasing Approach to L2 Speech Synthesis for Mispronunciation Diagnosis
Daniel Zhang, Ashwinkumar Ganesan, Sarah Campbell, Daniel Korzekwa

Challenges remain in Building ASR for Spontaneous Preschool Children Speech in Naturalistic Educational Environments
Satwik Dutta, Sarah Anne Tao, Jacob C. Reyna, Rebecca Elizabeth Hacker, Dwight W. Irvin, Jay F. Buzhardt, John H.L. Hansen

End-to-end Mispronunciation Detection with Simulated Error Distance
Zhan Zhang, Yuehai Wang, Jianyi Yang

BiCAPT: Bidirectional Computer-Assisted Pronunciation Training with Normalizing Flows
Zhan Zhang, Yuehai Wang, Jianyi Yang

Using Fluency Representation Learned from Sequential Raw Features for Improving Non-native Fluency Scoring
Kaiqi Fu, Shaojun Gao, Xiaohai Tian, Wei Li, MA Zejun

An Alignment Method Leveraging Articulatory Features for Mispronunciation Detection and Diagnosis in L2 English
Qi Chen, BingHuai Lin, YanLu Xie

RefTextLAS: Reference Text Biased Listen, Attend, and Spell Model For Accurate Reading Evaluation
Phani Sankar Nidadavolu, Na Xu, Nick Jutila, Ravi Teja Gadde, Aswarth Abhilash Dara, Joseph Savold, Sapan Patel, Aaron Hoff, Veerdhawal Pande, Kevin Crews, Ankur Gandhe, Ariya Rastrow, Roland Maas

CoCA-MDD: A Coupled Cross-Attention based Framework for Streaming Mispronunciation Detection and Diagnosis
Nianzu Zheng, Liqun Deng, Wenyong Huang, Yu Ting Yeung, Baohua Xu, Yuanyuan Guo, Yasheng Wang, Xiao Chen, Xin Jiang, Qun Liu








Speech Synthesis: Speaking Style, Emotion and Accents I


Expressive, Variable, and Controllable Duration Modelling in TTS
Syed Ammar Abbas, Thomas Merritt, Alexis Moinet, Sri Karlapati, Ewa Muszynska, Simon Slangen, Elia Gatti, Thomas Drugman

Predicting VQVAE-based Character Acting Style from Quotation-Annotated Text for Audiobook Speech Synthesis
Wataru Nakata, Tomoki Koriyama, Shinnosuke Takamichi, Yuki Saito, Yusuke Ijima, Ryo Masumura, Hiroshi Saruwatari

Adversarial and Sequential Training for Cross-lingual Prosody Transfer TTS
Min-Kyung Kim, Joon-Hyuk Chang

FluentTTS: Text-dependent Fine-grained Style Control for Multi-style TTS
Changhwan Kim, Seyun Um, Hyungchan Yoon, Hong-Goo Kang

Few Shot Cross-Lingual TTS Using Transferable Phoneme Embedding
Wei-Ping Huang, Po-Chun Chen, Sung-Feng Huang, Hung-yi Lee

Training Text-To-Speech Systems From Synthetic Data: A Practical Approach For Accent Transfer Tasks
Lev Finkelstein, Heiga Zen, Norman Casagrande, Chun-an Chan, Ye Jia, Tom Kenter, Alex Petelin, Jonathan Shen, Vincent Wan, Yu Zhang, Yonghui Wu, Robert Clark

Spoken-Text-Style Transfer with Conditional Variational Autoencoder and Content Word Storage
Daiki Yoshioka, Yusuke Yasuda, Noriyuki Matsunaga, Yamato Ohtani, Tomoki Toda

Analysis of expressivity transfer in non-autoregressive end-to-end multispeaker TTS systems
Ajinkya Kulkarni, Vincent Colotte, Denis Jouvet

Cross-lingual Style Transfer with Conditional Prior VAE and Style Loss
Dino Rattcliffe, You Wang, Alex Mansbridge, Penny Karanasou, Alexis Moinet, Marius Cotescu

Daft-Exprt: Cross-Speaker Prosody Transfer on Any Text for Expressive Speech Synthesis
Julian Zaïdi, Hugo Seuté, Benjamin van Niekerk, Marc-André Carbonneau

Language Model-Based Emotion Prediction Methods for Emotional Speech Synthesis Systems
Hyun-Wook Yoon, Ohsung Kwon, Hoyeon Lee, Ryuichi Yamamoto, Eunwoo Song, Jae-Min Kim, Min-Jae Hwang

Text aware Emotional Text-to-speech with BERT
Arijit Mukherjee, Shubham Bansal, Sandeepkumar Satpal, Rupesh Mehta




Speech Emotion Recognition II


Coupled Discriminant Subspace Alignment for Cross-database Speech Emotion Recognition
Shaokai Li, Peng Song, Keke Zhao, Wenjing Zhang, Wenming Zheng

Performance Improvement of Speech Emotion Recognition by Neutral Speech Detection Using Autoencoder and Intermediate Representation
Jennifer Santoso, Takeshi Yamada, Kenkichi Ishizuka, Taiichi Hashimoto, Shoji Makino

A Graph Isomorphism Network with Weighted Multiple Aggregators for Speech Emotion Recognition
Ying Hu, Yuwu Tang, Hao Huang, Liang He

Speech Emotion Recognition via Generation using an Attention-based Variational Recurrent Neural Network
Murchana Baruah, Bonny Banerjee

Speech Emotion: Investigating Model Representations, Multi-Task Learning and Knowledge Distillation
Vikramjit Mitra, Hsiang-Yun Sherry Chien, Vasudha Kowtha, Joseph Yitan Cheng, Erdrin Azemi

Multiple Enhancements to LSTM for Learning Emotion-Salient Features in Speech Emotion Recognition
Desheng Hu, Xinhui Hu, Xinkang Xu

Multi-level Fusion of Wav2vec 2.0 and BERT for Multimodal Emotion Recognition
Zihan Zhao, Yanfeng Wang, Yu Wang

CTA-RNN: Channel and Temporal-wise Attention RNN leveraging Pre-trained ASR Embeddings for Speech Emotion Recognition
Chengxin Chen, Pengyuan Zhang

Complex Paralinguistic Analysis of Speech: Predicting Gender, Emotions and Deception in a Hierarchical Framework
Alena Velichko, Maxim Markitantov, Heysem Kaya, Alexey Karpov

Interactive Co-Learning with Cross-Modal Transformer for Audio-Visual Emotion Recognition
Akihiko Takashima, Ryo Masumura, Atsushi Ando, Yoshihiro Yamazaki, Mihiro Uchida, Shota Orihashi

SpeechEQ: Speech Emotion Recognition based on Multi-scale Unified Datasets and Multitask Learning
Zuheng Kang, Junqing Peng, Jianzong Wang, Jing Xiao

Discriminative Feature Representation Based on Cascaded Attention Network with Adversarial Joint Loss for Speech Emotion Recognition
Yang Liu, Haoqin Sun, Wenbo Guan, Yuqi Xia, Zhen Zhao

Intra-speaker phonetic variation in read speech: comparison with inter-speaker variability in a controlled population
Nicolas Audibert, Cécile Fougeron




Low-Resource ASR Development II


Incremental Layer-Wise Self-Supervised Learning for Efficient Unsupervised Speech Domain Adaptation On Device
Zhouyuan Huo, Dongseong Hwang, Khe Chai Sim, Shefali Garg, Ananya Misra, Nikhil Siddhartha‎, Trevor Strohman, Francoise Beaufays

Non-Linear Pairwise Language Mappings for Low-Resource Multilingual Acoustic Model Fusion
Muhammad Umar Farooq, Darshan Adiga Haniya Narayana, Thomas Hain

The THUEE System Description for the IARPA OpenASR21 Challenge
Jing Zhao, Haoyu Wang, Jinpeng Li, Shuzhou Chai, Guanbo Wang, Guoguo Chen, Wei-Qiang Zhang

External Text Based Data Augmentation for Low-Resource Speech Recognition in the Constrained Condition of OpenASR21 Challenge
Guolong Zhong, Hongyu Song, Ruoyu Wang, Lei Sun, Diyuan Liu, Jia Pan, Xin Fang, Jun Du, Jie Zhang, Lirong Dai

Cross-dialect lexicon optimisation for an endangered language ASR system: the case of Irish
Liam Lonergan, Mengjie Qian, Neasa Ní Chiaráin, Christer Gobl, Ailbhe Ní Chasaide

Wav2vec-S: Semi-Supervised Pre-Training for Low-Resource ASR
Han Zhu, Li Wang, Gaofeng Cheng, Jindong Wang, Pengyuan Zhang, Yonghong Yan

Comparison of Unsupervised Learning and Supervised Learning with Noisy Labels for Low-Resource Speech Recognition
Yanick Schraner, Christian Scheller, Michel Plüss, Lukas Neukom, Manfred Vogel

Using cross-model learnings for the Gram Vaani ASR Challenge 2022
Tanvina Patel, Odette Scharenborg

ASR2K: Speech Recognition for Around 2000 Languages without Audio
Xinjian Li, Florian Metze, David R. Mortensen, Alan W Black, Shinji Watanabe

Combining Simple but Novel Data Augmentation Methods for Improving Conformer ASR
Ronit Damania, Christopher Homan, Emily Prud'hommeaux

OpenASR21: The Second Open Challenge for Automatic Speech Recognition of Low-Resource Languages
Kay Peterson, Audrey Tong, Yan Yu

DRAFT: A Novel Framework to Reduce Domain Shifting in Self-supervised Learning and Its Application to Children’s ASR
Ruchao Fan, Abeer Alwan

Plugging a neural phoneme recognizer into a simple language model: a workflow for low-resource setting
Séverine Guillaume, Guillaume Wisniewski, Benjamin Galliot, Minh-Châu Nguyên, Maxime Fily, Guillaume Jacques, Alexis Michaud









Spoken Language Processing I


Dynamic Sliding Window Modeling for Abstractive Meeting Summarization
Zhengyuan Liu, Nancy Chen

STUDIES: Corpus of Japanese Empathetic Dialogue Speech Towards Friendly Voice Agent
Yuki Saito, Yuto Nishimura, Shinnosuke Takamichi, Kentaro Tachibana, Hiroshi Saruwatari

kidsTALC: A Corpus of 3- to 11-year-old German Children’s Connected Natural Speech
Lars Rumberg, Christopher Gebauer, Hanna Ehlert, Maren Wallbaum, Lena Bornholt, Jörn Ostermann, Ulrike Lüdtke

DUAL: Discrete Spoken Unit Adaptive Learning for Textless Spoken Question Answering
Guan-Ting Lin, Yung-Sung Chuang, Ho-Lam Chung, Shu-wen Yang, Hsuan-Jui Chen, Shuyan Annie Dong, Shang-Wen Li, Abdelrahman Mohamed, Hung-yi Lee, Lin-shan Lee

Asymmetric Proxy Loss for Multi-View Acoustic Word Embeddings
Myunghun Jung, Hoi Rin Kim

Exploring Continuous Integrate-and-Fire for Adaptive Simultaneous Speech Translation
Chih-Chiang Chang, Hung-yi Lee

Building Vietnamese Conversational Smart Home Dataset and Natural Language Understanding Model
Thi Thu Trang NGUYEN, Trung Duc Anh Dang, Quoc Viet Vu, Woomyoung Park

DeToxy: A Large-Scale Multimodal Dataset for Toxicity Classification in Spoken Utterances
Sreyan Ghosh, Samden Lepcha, S Sakshi, Rajiv Ratn Shah, Srinivasan Umesh

Voice Activity Projection: Self-supervised Learning of Turn-taking Events
Erik Ekstedt, Gabriel Skantze

Enhanced Direct Speech-to-Speech Translation Using Self-supervised Pre-training and Data Augmentation
Sravya Popuri, Peng-Jen Chen, Changhan Wang, Juan Pino, Yossi Adi, Jiatao Gu, Wei-Ning Hsu, Ann Lee

QbyE-MLPMixer: Query-by-Example Open-Vocabulary Keyword Spotting using MLPMixer
Jinmiao Huang, Waseem Gharbieh, Qianhui Wan, Han Suk Shim, Hyun Chul Lee

DyConvMixer: Dynamic Convolution Mixer Architecture for Open-Vocabulary Keyword Spotting
Waseem Gharbieh, Jinmiao Huang, Qianhui Wan, Han Suk Shim, Hyun Chul Lee

Challenges in Metadata Creation for Massive Naturalistic Team-Based Audio Data
Chelzy Belitz, John H.L. Hansen



Phonetics II


Native phonotactic interference in L2 vowel processing: Mouse-tracking reveals cognitive conflicts during identification
Yizhou Wang, Rikke Bundgaard-Nielsen, Brett Baker, Olga Maxwell

Mandarin nasal place assimilation revisited: an acoustic study
Mingqiong Luo

Bending the string: intonation contour length as a correlate of macro-rhythm
Constantijn Kaland

Eliciting and evaluating likelihood ratios for speaker recognition by human listeners under forensically realistic channel-mismatched conditions
Vincent Hughes, Carmen Llamas, Thomas Kettig

Reducing uncertainty at the score-to-LR stage in likelihood ratio-based forensic voice comparison using automatic speaker recognition systems
Bruce Xiao Wang, Vincent Hughes

Durational Patterning at Discourse Boundaries in Relation to Therapist Empathy in Psychotherapy
Jonathan Him Nok Lee, Dehua Tao, Harold Chui, Tan Lee, Sarah Luk, Nicolette Wing Tung Lee, Koonkan Fung

Convolutional Neural Networks for Classification of Voice Qualities from Speech and Neck Surface Accelerometer Signals
Sudarsana Reddy Kadiri, Farhad Javanmardi, Paavo Alku

Applying Syntax–Prosody Mapping Hypothesis and Prosodic Well-Formedness Constraints to Neural Sequence-to-Sequence Speech Synthesis
Kei Furukawa, Takeshi Kishiyama, Satoshi Nakamura

Effects of Language Contact on Vowel Nasalization in Wenzhou and Rugao Dialects
Yan Li, Ying Chen, Xinya Zhang, Yanyang Chen, Jiazheng Wang

A blueprint for using deepfakes in sociolinguistic matched-guise experiments
Nathan Joel Young, David Britain, Adrian Leemann

Mandarin Tone Sandhi Realization: Evidence from Large Speech Corpora
Zuoyu Tian, Xiao Dong, Feier Gao, Haining Wang, Charles Lin

A Laryngographic Study on the Voice Quality of Northern Vietnamese Tones under the Lombard Effect
Giang Le, Chilin Shih, Yan Tang

The Prosody of Cheering in Sport Events
Marzena Zygis, Sarah Wesolek, Nina Hosseini-Kivanani, Manfred Krifka

Contribution of the glottal flow residual in affect-related voice transformation
Zihan Wang, Christer Gobl

High level feature fusion in forensic voice comparison
Michael Carne, Yuko Kinoshita, Shunichi Ishihara

Modeling speech recognition and synthesis simultaneously: Encoding and decoding lexical and sublexical semantic information into speech with no direct access to speech data
Gasper Begus, Alan Zhou

Paraguayan Guarani: Tritonal pitch accent and Accentual Phrase
Sun-Ah Jun, Maria Luisa Zubizarreta

Low-resource Accent Classification in Geographically-proximate Settings: A Forensic and Sociophonetics Perspective
Qingcheng Zeng, Dading Chong, Peilin Zhou, Jie Yang


Source Separation III


Tiny-Sepformer: A Tiny Time-Domain Transformer Network For Speech Separation
Jian Luo, Jianzong Wang, Ning Cheng, Edward Xiao, Xulong Zhang, Jing Xiao

Speaker-Aware Mixture of Mixtures Training for Weakly Supervised Speaker Extraction
Zifeng Zhao, Rongzhi Gu, Dongchao Yang, Jinchuan Tian, Yuexian Zou

SepIt: Approaching a Single Channel Speech Separation Bound
Shahar Lutati, Eliya Nachmani, Lior Wolf

On the Use of Deep Mask Estimation Module for Neural Source Separation Systems
Kai Li, Xiaolin Hu, Yi Luo

Target Confusion in End-to-end Speaker Extraction: Analysis and Approaches
Zifeng Zhao, Dongchao Yang, Rongzhi Gu, Haoran Zhang, Yuexian Zou

Embedding Recurrent Layers with Dual-Path Strategy in a Variant of Convolutional Network for Speaker-Independent Speech Separation
Xue Yang, Changchun Bao

Disentangling the Impacts of Language and Channel Variability on Speech Separation Networks
Fan-Lin Wang, Hung-Shin Lee, Yu Tsao, Hsin-Min Wang

Objective Metrics to Evaluate Residual-Echo Suppression During Double-Talk in the Stereophonic Case
Amir Ivry, Israel Cohen, Baruch Berdugo

QDPN - Quasi-dual-path Network for single-channel Speech Separation
Joel Rixen, Matthias Renz

Conformer Space Neural Architecture Search for Multi-Task Audio Separation
Shun Lu, Yang Wang, Peng Yao, Chenxing Li, Jianchao Tan, Feng Deng, Xiaorui Wang, Chengru Song

ResectNet: An Efficient Architecture for Voice Activity Detection on Mobile Devices
Okan Köpüklü, Maja Taseska

Gated Convolutional Fusion for Time-Domain Target Speaker Extraction Network
Wenjing Liu, Chuan Xie

WA-Transformer: Window Attention-based Transformer with Two-stage Strategy for Multi-task Audio Source Separation
Yang Wang, Chenxing Li, Feng Deng, Shun Lu, Peng Yao, Jianchao Tan, Chengru Song, Xiaorui Wang

Multichannel Speech Separation with Narrow-band Conformer
Changsheng Quan, Xiaofei Li

Separating Long-Form Speech with Group-wise Permutation Invariant Training
Wangyou Zhang, Zhuo Chen, Naoyuki Kanda, Shujie Liu, Jinyu Li, Sefik Emre Eskimez, Takuya Yoshioka, Xiong Xiao, Zhong Meng, Yanmin Qian, Furu Wei

Directed speech separation for automatic speech recognition of long form conversational speech
Rohit Paturi, Sundararajan Srinivasan, Katrin Kirchhoff, Daniel Garcia-Romero

Speech Separation for an Unknown Number of Speakers Using Transformers With Encoder-Decoder Attractors
Srikanth Raj Chetupalli, Emanuël Habets

Cooperative Speech Separation With a Microphone Array and Asynchronous Wearable Devices
Ryan Corey, Manan Mittal, Kanad Sarkar, Andrew C. Singer

Text-Driven Separation of Arbitrary Sounds
Kevin Kilgour, Beat Gfeller, Qingqing Huang, Aren Jansen, Scott Wisdom, Marco Tagliasacchi

An Empirical Analysis on the Vulnerabilities of End-to-End Speech Segregation Models
Rahil Parikh, Gaspar Rochette, Carol Espy-Wilson, Shihab Shamma


Speech Enhancement and Intelligibility


TaylorBeamformer: Learning All-Neural Beamformer for Multi-Channel Speech Enhancement from Taylor’s Approximation Theory
Andong Li, Guochen Yu, Chengshi Zheng, Xiaodong Li

How bad are artifacts?: Analyzing the impact of speech enhancement errors on ASR
Kazuma Iwamoto, Tsubasa Ochiai, Marc Delcroix, Rintaro Ikeshita, Hiroshi Sato, Shoko Araki, Shigeru Katagiri

Multi-source wideband DOA estimation method by frequency focusing and error weighting
Jing Zhou, Changchun Bao

Convolutional Recurrent Smart Speech Enhancement Architecture for Hearing Aids
Soha Nossier, Julie Wall, Mansour Moniri, Cornelius Glackin, Nigel Cannings

Fully Automatic Balance between Directivity Factor and White Noise Gain for Large-scale Microphone Arrays in Diffuse Noise Fields
Weixin Meng, Chengshi Zheng, Xiaodong Li

A Transfer and Multi-Task Learning based Approach for MOS Prediction
Xiaohai Tian, Kaiqi Fu, Shaojun Gao, Yiwei Gu, Kai Wang, Wei Li, Zejun Ma

Fusion of Self-supervised Learned Models for MOS Prediction
Zhengdong Yang, Wangjin Zhou, Chenhui Chu, Sheng Li, Raj Dabre, Raphael Rubino, Yi Zhao

Perceptual Contrast Stretching on Target Feature for Speech Enhancement
Rong Chao, Cheng Yu, Szu-wei Fu, Xugang Lu, Yu Tsao

A speech enhancement method for long-range speech acquisition task
YANZHANG GENG, Heng Wang, Tao Zhang, Xin Zhao

ESPnet-SE++: Speech Enhancement for Robust Speech Recognition, Translation, and Understanding
Yen-Ju Lu, Xuankai Chang, Chenda Li, Wangyou Zhang, Samuele Cornell, Zhaoheng Ni, Yoshiki Masuyama, Brian Yan, Robin Scheibler, Zhong-Qiu Wang, Yu Tsao, Yanmin Qian, Shinji Watanabe

MTI-Net: A Multi-Target Speech Intelligibility Prediction Model
Ryandhimas Edo Zezario, Szu-wei Fu, Fei Chen, Chiou-Shann Fuh, Hsin-Min Wang, Yu Tsao

Steering vector correction in MVDR beamformer for speech enhancement
Suliang Bu, Yunxin Zhao, Tuo Zhao

Speech Modification for Intelligibility in Cochlear Implant Listeners: Individual Effects of Vowel- and Consonant-Boosting
Juliana N. Saba, John H.L. Hansen

DCTCN:Deep Complex Temporal Convolutional Network for Long Time Speech Enhancement
Ren Jigang, Mao Qirong

Improve Speech Enhancement using Perception-High-Related Time-Frequency Loss
Ding Zhao, Zhan Zhang, Bin Yu, Yuehai Wang


Speech Synthesis: Speaking Style, Emotion and Accents II


Transplantation of Conversational Speaking Style with Interjections in Sequence-to-Sequence Speech Synthesis
Raul Fernandez, David Haws, Guy Lorberbom, Slava Shechtman, Alexander Sorin

Accurate Emotion Strength Assessment for Seen and Unseen Speech Based on Data-Driven Deep Learning
Rui Liu, Berrak Sisman, Björn Schuller, Guanglai Gao, Haizhou Li

Cross-speaker Emotion Transfer Based On Prosody Compensation for End-to-End Speech Synthesis
Tao Li, Xinsheng Wang, Qicong Xie, Zhichao Wang, Mingqi Jiang, Lei Xie

Self-supervised Context-aware Style Representation for Expressive Speech Synthesis
Yihan Wu, Xi Wang, Shaofei Zhang, Lei He, Ruihua Song, Jian-Yun Nie

Integrating Discrete Word-Level Style Variations into Non-Autoregressive Acoustic Models for Speech Synthesis
Zhaoci Liu, Ningqian Wu, Yajie Zhang, Zhenhua Ling

Automatic Prosody Annotation with Pre-Trained Text-Speech Model
Ziqian Dai, Jianwei Yu, Yan Wang, Nuo Chen, Yanyao Bian, GuangZhi Li, Deng Cai, Dong Yu

Enhancing Word-Level Semantic Representation via Dependency Structure for Expressive Text-to-Speech Synthesis
Yixuan Zhou, Changhe Song, Jingbei Li, Zhiyong Wu, Yanyao Bian, Dan Su, Helen Meng

Towards Multi-Scale Speaking Style Modelling with Hierarchical Context Information for Mandarin Speech Synthesis
Shun Lei, Yixuan Zhou, Liyang Chen, Jiankun Hu, Zhiyong Wu, Shiyin Kang, Helen Meng

Towards Cross-speaker Reading Style Transfer on Audiobook Dataset
Xiang Li, Changhe Song, Xianhao Wei, Zhiyong Wu, Jia Jia, Helen Meng

CALM: Constrastive Cross-modal Speaking Style Modeling for Expressive Text-to-Speech Synthesis
Yi Meng, Xiang Li, Zhiyong Wu, Tingtian Li, Zixun Sun, Xinyu Xiao, Chi Sun, Hui Zhan, Helen Meng

Improve emotional speech synthesis quality by learning explicit and implicit representations with semi-supervised training
Jiaxu He, Cheng Gong, Longbiao Wang, Di Jin, Xiaobao Wang, Junhai Xu, Jianwu Dang



×

Speech Synthesis: Toward end-to-end synthesis

Technology for Disordered Speech

Neural Network Training Methods for ASR I

Acoustic Phonetics and Prosody

Spoken Machine Translation

(Multimodal) Speech Emotion Recognition I

Dereverberation, Noise Reduction, and Speaker Extraction

Source Separation II

Embedding and Network Architecture for Speaker Recognition

Speech Representation II

Speech Synthesis: Linguistic Processing, Paradigms and Other Topics II

Other Topics in Speech Recognition

Audio Deep PLC (Packet Loss Concealment) Challenge

Robust Speaker Recognition

Speech Production

Speech Quality Assessment

Language Modeling and Lexical Modeling for ASR

Challenges and Opportunities for Signal Processing and Machine Learning for Multiple Smart Devices

Speech Processing & Measurement

Speech Synthesis: Acoustic Modeling and Neural Waveform Generation I

Show and Tell I

Spatial Audio

Single-channel Speech Enhancement II

Novel Models and Training Methods for ASR II

Spoken Dialogue Systems and Multimodality

Show and Tell I(VR)

Speech Emotion Recognition I

Single-channel Speech Enhancement I

Speech Synthesis: New Applications

Spoken Language Understanding I

Inclusive and Fair Speech Technologies I

Inclusive and Fair Speech Technologies II

Phonetics I

Multi-, Cross-lingual and Other Topics in ASR I

Zero, low-resource and multi-modal speech recognition I

Speaker Embedding and Diarization

Acoustic Event Detection and Classification

Speech Synthesis: Acoustic Modeling and Neural Waveform Generation II

ASR: Architecture and Search

Spoken Language Processing II

Source Separation I

ASR Technologies and Systems

Speech Perception

Spoken Term Detection and Voice Search

Speech and Language in Health: From Remote Monitoring to Medical Conversations I

Speech Synthesis: Linguistic Processing, Paradigms and Other Topics I

Show and Tell II

Multimodal Speech Emotion Recognition and Paralinguistics

Neural Transducers, Streaming ASR and Novel ASR Models

Zero, Low-resource and Multi-Modal Speech Recognition II

Atypical Speech Analysis and Detection

Adaptation, Transfer Learning, and Distillation for ASR

Speaker and Language Recognition I

Pathological Speech Analysis

Cross/Multi-lingual ASR

Speaking Styles and Interaction Styles I

Speaking Styles and Interaction Styles II

Speech Synthesis: Tools, Data, and Evaluation

Acoustic Signal Representation and Analysis II

Speech and Language in Health: From Remote Monitoring to Medical Conversations II

Dereverberation and Echo Cancellation

Voice Conversion and Adaptation III

Novel Models and Training Methods for ASR III

Spoken Language Modeling and Understanding

Acoustic Signal Representation and Analysis I

Privacy and Security in Speech Communication

Multimodal Systems

Atypical Speech Detection

Spoofing-Aware Automatic Speaker Verification (SASV) I

Single-channel and multi-channel Speech Enhancement

Voice Conversion and Adaptation II

Resource-constrained ASR

Speech Production, Perception and Multimodality

Multi-, Cross-lingual and Other Topics in ASR II

Spoken Language Processing III

Non-intrusive Objective Speech Quality Assessment (NISQA) Challenge for Online Conferencing Applications

Speech and Language in Health: From Remote Monitoring to Medical Conversations III

Speech Synthesis: Prosody Modeling

Self-supervised, Semi-supervised, Adaptation and Data Augmentation for ASR

Phonetics and Phonology

Spoken Language Understanding II

Speech Intelligibility Prediction for Hearing-Impaired Listeners I

Low-Resource ASR Development I

Speech representation I

Pathological Speech Assessment

Show and Tell III

Speaker and Language Recognition II

Speech Segmentation II

Robust ASR, and Far-field/Multi-talker ASR

ASR: Linguistic Components

Speech Intelligibility Prediction for Hearing-Impaired Listeners II

Show and Tell III(VR)

Summarization, Entity Extraction, Evaluation and Others

Automatic Analysis of Paralinguistics

Self Supervision and Anti-Spoofing

Speech Articulation & Neural Processing

Low Resource Spoken Language Understanding

Non-intrusive Objective Speech Quality Assessment (NISQA) Challenge for Online Conferencing Applications

Novel Models and Training Methods for ASR I

Acoustic scene analysis

Speech Coding and Privacy

Speech Synthesis: Singing, Multimodal, Crosslingual Synthesis

Applications in Transcription, Education and Learning II

Spoofing-Aware Automatic Speaker Verification (SASV) II

Speech Coding and Restoration

Streaming ASR

Applications in Transcription, Education and Learning I

Spoken Dialogue Systems

The VoiceMOS Challenge

Speech Synthesis: Speaking Style, Emotion and Accents I

Speech Segmentation I

Human Speech & Signal Processing

Speech Emotion Recognition II

Speaker Recognition and Anti-Spoofing

Miscellaneous Topics in Speech, Voice and Hearing Disorders

Low-Resource ASR Development II

Voice Conversion and Adaptation I

Search/Decoding Algorithms for ASR

Emotional Speech Production and Perception

Speech Analysis

Trustworthy Speech Processing

Speaker Recognition and Diarization

Self-supervised, Semi-supervised, Adaptation and Data Augmentation for ASR II

Spoken Language Processing I

Show and Tell IV

Phonetics II

Source Separation III

Speech Enhancement and Intelligibility

Speech Synthesis: Speaking Style, Emotion and Accents II

Show & Tell IV(VR)