Online Archive

Online Seminars

ISCA INTERNATIONAL VIRTUAL SEMINARS

A seminar programme is an important part of the life of a research lab, especially for its research students, but it's difficult for scientists to travel to give talks at the moment. However,  presentations may be given on line and, paradoxically, it may thus be possible for labs to engage international speakers who they wouldn't normally be able to afford.

ISCA has set up a pool of speakers prepared to give on-line talks. In this way we can enhance the experience of students working in our field, often in difficult conditions.

Speakers may pre-record their talks if they wish, but they don't have to. It is up to the host lab to contact speakers and make the arrangements. Talks can be state-of-the-art, or tutorials.

If you make use of this scheme and arrange a seminar, please send brief details (lab, speaker, date) to This email address is being protected from spambots. You need JavaScript enabled to view it.

The scheme complements this Distinguished Lecturers programme.

If you wish to join the scheme as a speaker, we need is a title, a short abstract, a 1 paragraph biopic and contact details. Please send them to This email address is being protected from spambots. You need JavaScript enabled to view it.

The speakers and their titles are given in this table. Further details follow.

 

Speaker Contact Title
Jean-Luc Schwartz This email address is being protected from spambots. You need JavaScript enabled to view it.

The perceptuo-motor nature of speech communication units, in light of phonetic knowledge, Bayesian computational models and neurocognitive data

Roger Moore

This email address is being protected from spambots. You need JavaScript enabled to view it.

Talk #1: Talking with Robots: Are We Nearly There Yet?Talk #2: A needs-driven cognitive architecture for future ‘intelligent’ communicative agents

Martin Cooke

This email address is being protected from spambots. You need JavaScript enabled to view it.

The perception of distorted speech

Sakriani Sakti

This email address is being protected from spambots. You need JavaScript enabled to view it.

Semi-supervised Learning for Low-resource Multilingual and Multimodal Speech Processing with Machine Speech Chain

John Hansen This email address is being protected from spambots. You need JavaScript enabled to view it.

Robust Diarization in Naturalistic Audio Streams: Recovering the Apollo Mission Control Audio

Thomas Hueber

 This email address is being protected from spambots. You need JavaScript enabled to view it.

Articulatory-acoustic modeling for assistive speech technologies: a focus on silent speech interfaces and biofeedback systems

Karen Livescu

This email address is being protected from spambots. You need JavaScript enabled to view it.

Recognition of Fingerspelled Words in American Sign Language in the Wild

Odette Scharenborg

This email address is being protected from spambots. You need JavaScript enabled to view it.

Talk 1: Reaching over the gap: Cross- and interdisciplinary research on human and automatic speech processing.Talk2: Speech representations and processing in deep neural networks.

Shrikanth (Shri) Narayanan

This email address is being protected from spambots. You need JavaScript enabled to view it.

Talk 1: Sounds of the human vocal instrument

Talk2: Computational Media Intelligence: Human-centered Machine Analysis of Media

Talk3: Multimodal Behavioral Machine Intelligence for health applications

Ann Bradlow

 

This email address is being protected from spambots. You need JavaScript enabled to view it.

 Second-language Speech Recognition by Humans and Machines

Shinji Watanabe

 

Shinji Watanabe <This email address is being protected from spambots. You need JavaScript enabled to view it.> Tackling Multispeaker Conversation Processing based on Speaker Diarization and Multispeaker Speech Recognition

 

Giuseppe Riccardi This email address is being protected from spambots. You need JavaScript enabled to view it. Empathy in Human Spoken Conversations
Bettina Braun This email address is being protected from spambots. You need JavaScript enabled to view it. The use of active learning systems for stimulus selection and data modelling in complex behavioral study designs
Amalia Arvaniti This email address is being protected from spambots. You need JavaScript enabled to view it. Forty years of the autosegmental-metrical theory of intonational phonology: an update and critical review in light of recent findings
Eric Fosler-Lussier This email address is being protected from spambots. You need JavaScript enabled to view it. Low resourced but long tailed spoken dialogue system building

Heiga Zen

This email address is being protected from spambots. You need JavaScript enabled to view it. Model-based text-to-speech synthesis

Ralf Schlueter

This email address is being protected from spambots. You need JavaScript enabled to view it. Automatic Speech Recognition in a State-of-Flux

 

Jean-Luc Schwartz

Contact: This email address is being protected from spambots. You need JavaScript enabled to view it.

Jean-Luc Schwartz, a Research Director at CNRS, studies speech perception, perceptuo-motor interactions in speech communication, phonetic bases of phonological systems and the emergence of language, with publications in generalist journals (Sciences Advances, Proceedings of the National Academy of Sciences USA) or specialized journals in cognitive psychology (e.g. Cognition, Perception & Psychophysics, Psychological Review, Behavioral & Brain Sciences, Hearing Research), neurosciences (e.g. Neuroimage, Human Brain Mapping, PLOS Comp. Biol., Brain & Language), signal processing and computational modelling (e.g. IEEE Trans. Speech and Audio Processing, JASA, Computer Speech and Language, Language and Cognitive Processes, Neural Computation), and phonetics in relation with phonology (e.g. Journal of Phonetics, Phonetica, Phonology Laboratory). He has been involved in many national and European projects, and was the PI of an ERC Advanced Grant called “Speech Unit(e)s - The multisensory-motor unity of speech”.

 

The perceptuo-motor nature of speech communication units, in light of phonetic knowledge, Bayesian computational models and neurocognitive data

Jean-Luc Schwartz, GIPSA-lab, Grenoble, France, CNRS – Univ. Grenoble Alpes

 The quest for phonetic invariance has crossed the speech communication literature since more than 50 years, motivating theories, generating experiments and resulting in the analysis of many laboratory or real-life phonetic data. In the last 15 years we have been developing a perceptuo-motor framework jointly addressing the nature of speech perception and production processes and claiming that speech communication units, emerging from the co-structuration of  perception and action in the course of speech development, are neither a sound, nor a gesture, but a perceptually-shaped gesture, that is a perceptuo-motor unit characterized by both its articulatory coherence  – provided by its gestural nature – and its perceptual value  – necessary for being functional (Schwartz et al., 2012).

 In this talk, starting from phonetic arguments in favor of this “perceptuo-motor theory of speech communication”, I will present a Bayesian computational modeling framework called COSMO (“Communication Objects by Sensori-Motor Operations”) which enabled to address a number of questions related to the nature, development and perceptual processing of speech units in relation with neurocognitive data on speech perception in the human brain.


 

Roger Moore

Contact: This email address is being protected from spambots. You need JavaScript enabled to view it.
Web: http://staffwww.dcs.shef.ac.uk/people/R.K.Moore/

Prof. Moore has over 40 years’ experience in Speech Technology R&D and, although an engineer by training, much of his research has been based on insights from human speech perception and production.  As Head of the UK Government's Speech Research Unit from 1985 to 1999, he was responsible for the development of the Aurix range of speech technology products and the subsequent formation of 20/20 Speech Ltd.  Since 2004 he has been Professor of Spoken Language Processing at the University of Sheffield, and also holds Visiting Chairs at Bristol Robotics Laboratory and University College London Psychology & Language Sciences.  He was President of the European/International Speech Communication Association from 1997 to 2001, General Chair for INTERSPEECH-2009 and ISCA Distinguished Lecturer during 2014-15.  In 2017 he organised the first international workshop on ‘Vocal Interactivity in-and-between Humans, Animals and Robots (VIHAR)’.  Prof. Moore is the current Editor-in-Chief of Computer Speech & Language and in 2016 he was awarded the LREC Antonio Zampoli Prize for "Outstanding Contributions to the Advancement of Language Resources & Language Technology Evaluation within Human Language Technologies” and in 2020 he was given the International Speech Communication Association Special Service Medal for "service in the establishment, leadership and international growth of ISCA".

 

Talk #1: Talking with Robots: Are We Nearly There Yet?

Abstract: Recent years have seen considerable progress in the deployment of 'intelligent' communicative agents such as Apple's Siri and Amazon’s Alexa. However, effective speech-based human-robot dialogue is less well developed; not only do the fields of robotics and spoken language technology present their own special problems, but their combination raises an additional set of issues. In particular, there appears to be a large gap between the formulaic behaviour that typifies contemporary spoken language dialogue systems and the rich and flexible nature of human-human conversation. As a consequence, we still seem to be some distance away from creating Autonomous Social Agents such as robots that are truly capable of conversing effectively with their human counterparts in real world situations. This talk will address these issues and will argue that we need to go far beyond our current capabilities and understanding if we are to move from developing robots that simply talk and listen to evolving intelligent communicative machines that are capable of entering into effective cooperative relationships with human beings.

Talk #2: A needs-driven cognitive architecture for future ‘intelligent’ communicative agents


Abstract: Recent years have seen considerable progress in the deployment of ‘intelligent’ communicative agents such as Apple’s Siri, Google Now, Microsoft’s Cortana and Amazon’s Alexa. Such speech-enabled assistants are distinguished from the previous generation of voice-based systems in that they claim to offer access to services and information via conversational interaction. In reality, interaction has limited depth and, after initial enthusiasm, users revert to more traditional interface technologies. This talk argues that the standard architecture for a contemporary communicative agent fails to capture the fundamental properties of human spoken language. So an alternative needs-driven cognitive architecture is proposed which models speech-based interaction as an emergent property of coupled hierarchical feedback control processes. The implications for future spoken language systems are discussed.


 

Martin Cooke

Contact: This email address is being protected from spambots. You need JavaScript enabled to view it.

Martin Cooke is Ikerbasque Research Professor in the Language and Speech Lab at the University of the Basque Country, Spain. After starting his career in the UK National Physical Laboratory, he worked at the University of Sheffield for 26 years before taking up his current position. His research has focused on analysing the computational auditory scene, investigating human speech perception and devising algorithms for robust automatic speech recognition. His interest in these domains also includes the effects of noise on speech production, as well as second language listening and acquisition models. He currently coordinates the EU Marie Curie Network ENRICH which focuses on listening effort.

The perception of distorted speech

Listeners are capable of accommodating a staggering variety of speech forms, including those that bear little superficial resemblance to canonical speech. Speech can be understood on the basis of a mere pair of synthetic formants sent to different ears, or from three time-varying sinewaves, or from four bands of modulated noise. Surprisingly high levels of accuracy can be obtained from the summed output of two exceedingly narrow filters at the extremes of the frequency range of speech, or after interchanging the fine structure with a non-speech signal such as music. Any coherent theory of human speech perception must be able to account not just for the processing of canonical speech but also explain how an individual listener, perhaps aided by a period of perceptual learning, is capable of handling all of these disparate distorted forms of speech. In the first part of the talk I'll review a century's worth of distorted speech types, suggest some mechanisms listeners might be using to accomplish this feat, and provide anecdotal evidence for a human-machine performance gap for these speech types. In the second part I'll present results from two recent studies in my lab, one concerning a new form of distortion I call sculpted speech, the second looking at the fine time course of perceptual adaptation to eight types of distortion. I'll conclude with some pointers to what the next generation of machine listening might learn from human abilities to process these extremely-variable forms.


 

 Sakriani Sakti

Contact: This email address is being protected from spambots. You need JavaScript enabled to view it.


Sakriani Sakti is currently a research associate professor at Nara Institute of Science and Technology (NAIST) and a research scientist at RIKEN Center for Advanced Intelligent Project (RIKEN AIP), Japan. She received DAAD-Siemens Program Asia 21st Century Award in 2000 to study in Communication Technology, University of Ulm, Germany, and received her MSc degree in 2002. During her thesis work, she worked with the Speech Understanding Department, DaimlerChrysler Research Center, Ulm, Germany. She then worked as a researcher at ATR Spoken Language Communication (SLC) Laboratories Japan in 2003-2009, and NICT SLC Groups Japan in 2006-2011, which established multilingual speech recognition for speech-to-speech translation. While working with ATR and NICT, Japan, she continued her study (2005-2008) with Dialog Systems Group University of Ulm, Germany, and received her Ph.D. degree in 2008. She was actively involved in international collaboration activities such as Asian Pacific Telecommunity Project (2003-2007) and various speech-to-speech translation research projects, including A-STAR and U-STAR (2006-2011). In 2011-2017, she was an assistant professor at the Augmented Human Communication Laboratory, NAIST, Japan. She also served as a visiting scientific researcher of INRIA Paris-Rocquencourt, France, in 2015-2016, under JSPS Strategic Young Researcher Overseas Visits Program for Accelerating Brain Circulation. Since January 2018, she serves as a research associate professor at NAIST and a research scientist at RIKEN AIP, Japan. She is a member of JNS, SFN, ASJ, ISCA, IEICE, and IEEE. Furthermore, she is currently a committee member of IEEE SLTC (2021-2023) and an associate editor of the IEEE/ACM Transactions on Audio, Speech, and Language Processing (2020-2023). She was a board member of Spoken Language Technologies for Under-resourced languages (SLTU) and the general chair of SLTU2016. She was also the general chair of the "Digital Revolution for Under-resourced Languages (DigRevURL)" Workshop as the Interspeech Special Session in 2017 and DigRevURL Asia in 2019. She was also the organizing committee of the Zero Resource Speech Challenge 2019 and 2020. She was also involved in creating joint ELRA and ISCA Special Interest Group on Under-resourced Languages (SIGUL) and served as SIGUL Board since 2018. Last year, in collaboration with UNESCO and ELRA, she was also the organizing committee of the International Conference of "Language Technologies for All (LT4All): Enabling Linguistic Diversity and Multilingualism Worldwide". Her research interests lie in deep learning & graphical model framework, statistical pattern recognition, zero-resourced speech technology, multilingual speech recognition and synthesis, spoken language translation, social-affective dialog system, and cognitive-communication.

Semi-supervised Learning for Low-resource Multilingual and Multimodal Speech Processing with Machine Speech Chain

The development of advanced spoken language technologies based on automatic speech recognition (ASR) and text-to-speech synthesis (TTS) has enabled computers to either learn how to listen or speak. Many applications and services are now available but still support fewer than 100 languages. Nearly 7000 living languages that are spoken by 350 million people remain uncovered. This is because the construction is commonly done based on machine learning trained in a supervised fashion where a large amount of paired speech and corresponding transcription is required.
In this talk, we will introduce a semi-supervised learning mechanism based on a machine speech chain framework. First, we describe the primary machine speech chain architecture that learns not only to listen or speak but also to listen while speaking. The framework enables ASR and TTS to teach each other given unpaired data. After that, we describe the use of machine speech chain for code-switching and cross-lingual ASR and TTS of several languages, including low-resourced ethnic languages. Finally, we describe the recent multimodal machine chain that mimics overall human communication to listen while speaking and visualizing. With the support of image captioning and production models, the framework enables ASR and TTS to improve their performance using an image-only dataset.

 


 

John Hansen

Contact: This email address is being protected from spambots. You need JavaScript enabled to view it.

John H.L. Hansen, received Ph.D. & M.S. degrees from Georgia Institute of Technology, and B.S.E.E. degree from Rutgers Univ. He joined Univ. of Texas at Dallas (UTDallas) in 2005, where he is Associate Dean for Research, Professor of Electrical & Computer Engineering, Distinguished Univ. Chair in Telecommunications Engineering, and holds a joint appointment in School of Behavioral & Brain Sciences (Speech & Hearing). At UTDallas, he established Center for Robust Speech Systems (CRSS). He is an ISCA Fellow, IEEE Fellow, past Member and TC-Chair of IEEE Signal Proc. Society, Speech & Language Proc. Tech. Comm.(SLTC), and Technical Advisor to U.S. Delegate for NATO (IST/TG-01). He currently serves as ISCA President. He has supervised 92 PhD/MS thesis candidates, was recipient of 2020 UT-Dallas Provost’s Award for Grad. Research Mentoring, 2005 Univ. Colorado Teacher Recognition Award, and author/co-author of +750 journal/conference papers in the field of speech/language/hearing processing & technology. He served as General Chair for Interspeech-2002, Co-Organizer and Tech. Chair for IEEE ICASSP-2010, and Co-General
Chair and Organizer for IEEE Workshop on Spoken Language Technology (SLT-2014) (Lake Tahoe, NV). He is serving as Co-Chair for ISCA INTERSPEECH-2022, and Tech. Chair for IEEE ICASSP-2024.

Robust Diarization in Naturalistic Audio Streams: Recovering the Apollo Mission Control Audio

Speech Technology has advanced significantly beyond general speech recognition for voice command and telephony applications. Today, the emergence of BIG DATA, Machine Learning, as well as voice enabled speech systems have required the need for effective voice capture and automatic speech/speaker recognition. The ability to employ speech and language technology to assess human-to-human interactions is opening up new research paradigms which can have a profound impact on assessing human
interaction. In this talk, we will focus on BIG DATA audio processing relating to the APOLLO lunar missions. ML based technology advancements include automatic audio diarization and speaker recognition for audio streams which include multi-tracks, speakers, and environments. CRSS-UTDallas built a recovery solution for lost 30-track audio tapes from NASA Apollo-11, resulting in a massive multi-track audio processing (19,000hrs) of data. Recent additional support from NSF will allow for the recovery and organization of an additional 150,000hrs of mission data to be shared with the communities of: (i) speech/language technology, (ii) STEM/science and team-based researchers, and (iii) education/historical/archiving specialists.


 

Thomas Hueber

Contact: This email address is being protected from spambots. You need JavaScript enabled to view it.

Dr. Thomas Hueber is a tenured CNRS researcher at GIPSA-lab (Grenoble, France) since 2011. He is head of the "Cognitive Robotics, Interactive System and Speech Processing" (CRISSP) team. He holds an engineering degree and a M.Sc. in Signal Processing from University of Lyon in 2006, a Ph.D. in Computer Science from Pierre and Marie Curie University (Paris) in 2009, and a HDR (accreditation to supervise research) from Grenoble-Alpes University in 2019. His research activities deal with multimodal speech processing, with a special interest in assistive technologies that exploit speech articulatory gestures and physiological activities. He coauthored 17 articles in peer-reviewed international journals, more than 35 articles in peerreviewed international conferences, 3 book chapters and one patent. He received in 2011 the 6th Christian Benoit award (ISCA/AFCP). In 2017, he co-edited in IEEE/ACM Trans. Audio Speech and Language Processing the special issue on Biosignal-based speech processing.

Articulatory-acoustic modeling for assistive speech technologies: a focus on silent speech interfaces and biofeedback systems

Speech production is a complex motor process involving several physiological phenomena, such as neural, nervous and muscular activities that drive our respiratory, laryngeal and articulatory systems. Over the last 15 years, an increasing number of studies have proposed to rely on these activities to build devices that could restore oral communication when a part of the speech production chain is damaged, or that could help rehabilitate speech sound disorders. In this talk, I will focus on two lines of research 1) silent speech interfaces converting speech articulatory movements into text or synthetic speech, and 2) biofeedback systems providing tongue visual information for speech therapy and language learning. I will give anoverview of the literature on these fields that face common challenges and methodological frameworks. I will present some of our recent contributions with a focus on experimental techniques to capture multimodal speech-related signals, machine learning algorithms to model articulatory-acoustic relationships, and clinical evaluation of real-time prototypes.


 

Karen Livescu

Contact: This email address is being protected from spambots. You need JavaScript enabled to view it.

Karen Livescu is an Associate Professor at TTI-Chicago.  She completed her PhD in electrical engineering and computer science at MIT.   Her main research interests are in speech and language processing, as well as related problems in machine learning.  Her recent work includes unsupervised and multi-view representation learning, acoustic word embeddings, visually grounded speech modeling, and automatic sign language recognition.  She is a 2021 IEEE SPS Distinguished Lecturer.  Other recent professional activities include serving as a program chair of ICLR 2019, a technical chair of ASRU 2015/2017/2019, and Associate Editor for IEEE T-PAMI and IEEE OJ-SP.

Recognition of Fingerspelled Words in American Sign Language in the Wild

Sign languages, consisting of sequences of handshapes and gestures, is a chief means of communication among deaf people around the world.  Automatic recognition and translation of sign languages would help facilitate communication between deaf and hearing individuals.  It could also enable services such as search and retrieval in deaf social and news video media.  Our recent work has focused on detecting and recognizing fingerspelling in American Sign Language (ASL).  Fingerspelling is a component of ASL in which words are signed by a series of handshapes or trajectories corresponding to single letters in the English alphabet.  Fingerspelling accounts for up to 35% of ASL, and often appears in technical language and language involving names and current events.  Detecting and transcribing the fingerspelled portions of sign language video could add a great deal of value, since these portions are often dense in content words.  Our work addresses the problem of fingerspelling recognition “in the wild”, using video data sets we have collected from online media.  These data sets -- the Chicago Fingerspelling in the Wild data sets -- are the largest ones available so far for fingerspelling recognition, and the first using naturally occurring video data.  In this challenging natural setting, with both visual challenges and extensive coarticulation, typical computer vision techniques for pose estimation and hand detection often fail.  We have developed approaches for fingerspelling detection and recognition based on encoder-decoder and connectionist temporal classification (CTC) models, enabled with new techniques for "soft" tracking that do not assume the availability of high-performance vision models for pose estimation or hand detection.  This talk will describe our recent work as well as the broader context of general sign language processing.


 

Odette Scharenborg

Contact: This email address is being protected from spambots. You need JavaScript enabled to view it.

Multimedia Computing Group
Faculty of Electrical Engineering, Mathematics, and Computer Science
Delft University of Technology
The Netherlands
Twitter: Oscharenborg
Website: https://odettescharenborg.wordpress.com

Odette Scharenborg  is an Associate Professor and Delft Technology Fellow at Delft University of Technology, the Netherlands. Her research interests focus on narrowing the gap between automatic and human spoken-word recognition. Particularly, she is interested in the question where the difference between human and machine recognition performance originates, and whether it is possible to narrow this performance gap. In her research she combines different research methodologies ranging from human listening experiments to computational modelling and deep learning. Odette co-organized the Interspeech 2008 Consonant Challenge, which aimed at promoting comparisons of human and machine speech recognition in noise. In 2017, she was elected onto the ISCA board, and in 2018 onto the IEEE Speech and Language Processing Technical Committee. She is an associate editor of IEEE Signal Processing Letters and a member of the European Laboratory for Learning and Intelligent Systems (ELLIS) unit Delft. She has served as area chair of Interspeech since 2015 and currently is on the Technical Programme Committee of Interspeech 2021 Brno.



Reaching over the gap: Cross- and interdisciplinary research on human and automatic speech processing.

The fields of human speech recognition (HSR) and automatic speech recognition (ASR) both investigate parts of the speech recognition process and have word recognition as their central issue. Although the research fields appear closely related, their aims and research methods are quite different. Despite these differences there is, however, in the past two decades a growing interest in possible cross-fertilisation. Researchers from both ASR and HSR are realising the potential benefit of looking at the research field on the other side of the ‘gap’. In this survey talk, I will provide an overview of past and present efforts to link human and automatic speech recognition research and present an overview of the literature describing the performance difference between machines and human listeners. The focus of the talk is on the mutual benefits to be derived from establishing closer collaborations and knowledge interchange between ASR and HSR.


Speech representations and processing in deep neural networks.

Speech recognition is the mapping of a continuous, highly variable speech signal onto discrete, abstract representations. The question how is speech represented and processed in the human brain and in automatic speech recognition (ASR) systems, although crucial in both the field of human speech processing and the field of automatic speech processing, has historically been investigated in the two fields separately. I will argue that comparisons between humans and deep neural network (DNN)-based ASRs and cross-fertilization of the two research fields can provide valuable insights into the way humans process speech and improve ASR technology. Specifically, I will present results of several experiments carried out on both human listeners and DNN-based ASR systems on the representation and processing of speech in human listeners and DNNs. In order to investigate the speech representations and adaptation processes in the DNN-based ASR systems, we visualized the activations in the hidden layers of the DNN. These visualizations reveal that DNNs use speech representations that are similar to those used by human listeners, without being explicitly taught to do so.


 

Shrikanth (Shri) Narayanan

Contact: This email address is being protected from spambots. You need JavaScript enabled to view it.

University of Southern California, Los Angeles, CA

Signal Analysis and Interpretation Laboratory

https://sail.usc.edu/people/shri.html

Shrikanth (Shri) Narayanan is University Professor and Niki & C. L. Max Nikias Chair in Engineering at the University of Southern California, where he is Professor of Electrical & Computer Engineering, Computer Science, Linguistics, Psychology, Neuroscience, Pediatrics, and Otolaryngology—Head & Neck Surgery, Director of the Ming Hsieh Institute and Research Director of the Information Sciences Institute. Prior to USC he was with AT&T Bell Labs and AT&T Research. His research focuses on human-centered information processing and communication technologies.  He is a Fellow of the National Academy of Inventors, the Acoustical Society of America, IEEE, ISCA, the American Association for the Advancement of Science (AAAS), the Association for Psychological Science, and the American Institute for Medical and Biological Engineering (AIMBE).  He is a recipient of several honors including the 2015 Engineers Council’s Distinguished Educator Award, a Mellon award for mentoring excellence, the 2005 and 2009 Best Journal Paper awards from the IEEE Signal Processing Society and serving as its Distinguished Lecturer for 2010-11, a 2018 ISCA CSL Best Journal Paper award, and serving as an ISCA Distinguished Lecturer for 2015-16, Willard R. Zemlin Memorial Lecturer for ASHA in 2017, and the Ten Year Technical Impact Award in 2014 and the Sustained Accomplishment Award in 2020 from ACM ICMI. He has published over 900 papers and has been granted seventeen U.S. patents. His research and inventions have led to technology commercialization including through startups he co-founded: Behavioral Signals Technologies focused on the telecommunication services and AI based conversational assistance industry and Lyssn focused on mental health care delivery, treatment and quality assurance.

Sounds of the human vocal instrument

The vocal tract is the universal human instrument played with great dexterity to produce the elegant acoustic structuring of speech, song and other sounds to communicate intent and emotions. The sounds produced by the vocal instrument also carry crucial information about individual identity and the state of health and wellbeing. A longstanding research challenge has been in improving the understanding of how vocal tract structure and function interact, and notably in illuminating the variant and invariant aspects of speech (and beyond) within and across individuals.  The first part of the talk will highlight engineering advances that allow us to perform investigations on the human vocal tract in action-- from capturing the dynamics of vocal production using novel real-time magnetic resonance imaging to machine learning based articulatory-audio modeling--to offer insights about how we produce sounds with the vocal instrument. The second part of the talk will highlight some scientific, technological and clinical applications using such multimodal data driven approaches in the study of the human vocal instrument.

Computational Media Intelligence: Human-centered Machine Analysis of Media

Media is created by humans for humans to tell stories. There is a natural and imminent need that exists for creating human-centered media analytics to illuminate the stories being told and to understand their human impact. Objective rich media content analysis has numerous applications to different stakeholders: from creators and decision/policy makers to consumers. Advances in multimodal signal processing and machine learning enable detailed and nuanced characterization of media content: of what, who, how, and why, and help understand and predict impact, both individual (emotional) experiences and broader societal consequences.

Emerging advances have enabled us to measure the various multimodal facets of media and answer these questions on a global scale. Today, deep learning algorithms can analyze entertainment media (movies, TV) and quantify gender, age and race representations and measure how often women and underrepresented minorities appear in scenes or how often they speak to create awareness in objective ways not possible before. Text mining algorithms and natural language processing (NLP) can understand language use in movie scripts, and dialog interactions to track patterns of who is interacting with whom and how, and study trends in their adoption by different communities. Moreover, advances in human sensing allow for directly measuring the influence and impact of media on an individual’s physiology (and brain), while progress in social media measurements enable tracking the spread and social impact of media content on social communities.

This talk will focus on the opportunities and advances in human-centered media intelligence drawing examples from media for entertainment (e.g., movies) and commerce (e.g., advertisements).  It will highlight multimodal processing of audio, video and text streams and other metadata associated with the content creation to provide insights media stories including any human-centered trends and patterns such as unconscious biases along dimensions such as about gender, race and age, as well as associated social e.g., violence and commercial aspects e.g., box office returns, relatable to media content.

Multimodal Behavioral Machine Intelligence for health applications

The convergence of sensing, communication and computing technologies — most dramatically witnessed in the global proliferation of smartphones, and IoT deployments — offers tremendous opportunities for continuous acquisition, analysis and sharing of diverse, information-rich yet unobtrusive time series data that provide a multimodal, spatiotemporal characterization of an individual’s behavior and state, and of the environment within which they operate. This has in turn enabled hitherto unimagined possibilities for understanding and supporting various aspects of human functioning in realms ranging from health and well-being to job performance.

These include data that afford the analysis and interpretation of multimodal cues of verbal and non-verbal human behavior to facilitate human behavioral research and its translational applications in healthcare.  These data not only carry crucial information about a person’s intent, identity and trait but also underlying attitudes, emotions and other mental state constructs. Automatically capturing these cues, although vastly challenging, offers the promise of not just efficient data processing but in creating tools for discovery that enable hitherto unimagined scientific insights, and means for supporting diagnostics and interventions.

Recent computational approaches that have leveraged judicious use of both data and knowledge have yielded significant advances in this regard, for example in deriving rich, context-aware information from multimodal signal sources including human speech, language, and videos of behavior. These are even complemented and integrated with data about human brain and body physiology.   This talk will focus on some of the advances and challenges in gathering such data and creating algorithms for machine processing of such cues.  It will highlight some of our ongoing efforts in Behavioral Signal Processing (BSP)—technology and algorithms for quantitatively and objectively understanding typical, atypical and distressed human behavior—with a specific focus on communicative, affective and social behavior. The talk will illustrate Behavioral Informatics applications of these techniques that contribute to quantifying higher-level, often subjectively described, human behavior in a domain-sensitive fashion. Examples will be drawn from mental health and well being realms such as Autism Spectrum Disorder, Couple therapy, Depression, Suicidality, and Addiction counseling.


 

Ann Bradlow

This email address is being protected from spambots. You need JavaScript enabled to view it. 

 

Ann Bradlow received her PhD in Linguistics from Cornell University in 1993.  She completed postdoctoral fellowships in Psychology at Indiana University (1993-1996) and Hearing Science at Northwestern University (1996-1998).  Since 1998, Bradlow has been a faculty member in the Linguistics Department at Northwestern University (USA) where she directs the Speech Communication Research Group (SCRG).  The SCRG pursues an interdisciplinary research program in acoustic phonetics and speech perception with a focus on speech intelligibility under conditions of talker-, listener-, and situation-related variability.  A central line of current work investigates causes and consequences of divergent patterns of first-language (L1) and second-language (L2) speech production and perception.

 Second-language Speech Recognition by Humans and Machines

This presentation will consider the causes, characteristics, and consequences of second-language (L2) speech production through the lens of a talker-listener alignment model.  Rather than focusing on L2 speech as deviant from the L1 target, this model views speech communication as a cooperative activity in which interlocutors adjust their speech production and perception in a bi-directional, dynamic manner.  Three lines of support will be presented.  First, principled accounts of salient acoustic-phonetic markers of L2 speech will be developed with reference to language-general challenges of L2 speech production and to language-specific L1-L2 structural interactions.  Next, I will examine recognition of L2 speech by listeners from various language backgrounds, noting in particular that for L2 listeners, L2 speech can be equally (or sometimes, more) intelligible than L1 speech.  Finally, I will examine perceptual adaptation to L2 speech by L1 listeners, highlighting studies that focused on interactive, dialogue-based test settings where we can observe the dynamics of talker adaptation to the listener and vice versa.  Throughout this survey, I will refer to current methodological and technical developments in corpus-based phonetics and interactive testing paradigms that open new windows on the dynamics of speech communication across a language barrier.

 


 

Shinji Watanabe

Shinji Watanabe <This email address is being protected from spambots. You need JavaScript enabled to view it.>

Shinji Watanabe is an Associate Professor at Carnegie Mellon University, Pittsburgh, PA. He received his B.S., M.S., and Ph.D. (Dr. Eng.) degrees from Waseda University, Tokyo, Japan. He was a research scientist at NTT Communication Science Laboratories, Kyoto, Japan, from 2001 to 2011, a visiting scholar in Georgia institute of technology, Atlanta, GA in 2009, a Senior Principal Research Scientist at Mitsubishi Electric Research Laboratories (MERL), Cambridge, MA USA from 2012 to 2017, and an associate research professor at Johns Hopkins University, Baltimore, MD from 2017 to 2020. His research interests include automatic speech recognition, speech enhancement, spoken language understanding, and machine learning for speech and language processing. He has been published more than 200 papers in peer-reviewed journals and conferences and received several awards, including the best paper award from the IEEE ASRU in 2019. He served as an Associate Editor of the IEEE Transactions on Audio Speech and Language Processing. He was/has been a member of several technical committees, including the APSIPA Speech, Language, and Audio Technical Committee (SLA), IEEE Signal Processing Society Speech and Language Technical Committee (SLTC), and Machine Learning for Signal Processing Technical Committee (MLSP).
 

Tackling Multispeaker Conversation Processing based on Speaker Diarization and Multispeaker Speech Recognition


Recently, speech recognition and understanding studies have shifted their focus from single-speaker automatic speech recognition (ASR) in controlled scenarios to more challenging and realistic multispeaker conversation analysis based on ASR and speaker diarization. The CHiME speech separation and recognition challenge is one of the attempts to tackle these new paradigms. This talk first describes the introduction and challenge results of the latest CHiME-6 challenge, focusing on recognizing multispeaker conversations in a dinner party scenario. The second part of this talk is to tackle this problem based on an emergent technique based on an end-to-end neural architecture. We first introduce an end-to-end single-microphone multispeaker ASR technique based on a recurrent neural network and transformer to show the effectiveness of the proposed method. Second, we extend this approach to leverage the benefit of the multi-microphone input and realize simultaneous speech separation and recognition within a single neural network trained only with the ASR objective. Finally, we also introduce our recent attempts of speaker diarization based on end-to-end neural architecture, including basic concepts, on-line extensions, and handling unknown numbers of speakers.


 Giuseppe Riccardi

This email address is being protected from spambots. You need JavaScript enabled to view it.

Biography
Prof. Giuseppe Riccardi is founder and director of the Signals and Interactive Systems Lab at University of Trento, Italy. Prof. Riccardi has co-authored more than 200 scientific papers. He holds more than 90 patents in the field of automatic speech recognition, understanding, machine translation, natural language processing and machine learning. His current research interests are natural language modeling and understanding, spoken/multimodal dialogue, affective computing, machine learning and social computing.
Prof. Riccardi has been on the scientific and organizing committee of EUROSPEECH, INTERSPEECH, ICASSP, ASRU, SLT, NAACL, EMNLP, ACL and EACL. He has been elected member of the IEEE SPS Speech Technical Committee (2005-2008). He is a member of ACL, ACM and elected Fellow of IEEE (2010) and of ISCA (2017).

Empathy in Human Spoken Conversations

Empathy will be a critical ability of next-generation conversational agents. Empathy, as defined in behavioral sciences such as psychology, expresses the ability of human beings to recognize, understand and react to sensations, emotions, attitudes and beliefs of others. However, most computational speech and language research is limited to the emotion recognition ability only. We aim at reviewing the behavioral constructs of empathy, the acoustic and linguistic manifestations and its interaction with basic emotions. In psychology, there is no operational definition of empathy, which makes it vague and difficult to measure. In this talk, we review and evaluate a recently proposed categorical annotation protocol for empathy. This annotation protocol has been applied to a large corpus of real-life, dyadic natural spoken conversations. We will review the behavioral signal analysis of patterns of emotions and empathy.


Bettina Braun

Contact:  This email address is being protected from spambots. You need JavaScript enabled to view it.

Bettina Braun studied phonetics, phonology and computational linguistics and graduaded with a PhD from Saarland University (Germany) in 2004. After Postdoc positions at the Phonetics Laboratory of the University of Oxford and the Max-Planck-Insititute for Psycholinguistics in Nijmegen (NL), she took a professorship for General Linguistics at the Unviversity of Konstanz (Germany). Her research lies in the fields of prosody (form and function of prosodic information) and its interaction with other areas (particles, syntax), both in native language processing and also in language acquisition. She has led a number of externally funded research projects, on the online (real-time) processing of intonation, on the signaling of bias in polar questions, and on the production and perception of rhetorical quesitons. She is head of the PhonLab and the BabySpeechLab at the University of Konstanz.

The use of active learning systems for stimulus selection and data modelling in complex behavioral study designs


In psycholinguistic and phonetic research, researchers often study the relative weighting of different cues for the interpretation of a linguistic phenomenon (from phonemes to speech acts). It may also be of interest, how well the results generalize to different items, which necessitates the use of a number of different items. This limits the number of conditions and lexicalizations that can be tested in the same experiment. Active Learning (AL) techniques may surmount these difficulties, allowing for the testing of more conditions within fewer trials, as stimulus selection is informed by the system’s learning mechanism and the predicted probabilities from the models are the results. In the present study, we test the feasibility and validity of AL and discuss its potential usefulness in psycholinguistic cue weighting research. To this end, we replicated three patterns of results for a 2x2x2 design (3 prosodic variables with 2 levels each) using an average probability weighting and a regression-based weighting and tested the reliability of the AL system and the speed with which the pattern of results were approached. Our findings show that AL with regression weighting reliably predicts all result patterns with only a quarter of the trials of the original psycholinguistic experiment.



Amalia Arvaniti

Contact details: This email address is being protected from spambots. You need JavaScript enabled to view it.

Biopic: Amalia Arvaniti holds the Chair of English Language and Linguistics at Radboud University, Netherlands. She had previously held research and teaching appointments at the University of Kent (2012-2020), UC San Diego (2001-2012), the University of Cyprus (1995-2001), as well as Cambridge, Oxford, Edinburgh. Her research, which focuses on the cross-linguistic study of prosody, has been widely published and cited, and has led to paradigm-shifts in our understanding of speech rhythm and intonation. Her current research on prosody focuses on intonation and is supported by an ERC-funded grant (ERC-ADG-835263; 2019-2024) titled Speech Prosody in Interaction: The form and function of intonation in human communication (SPRINT). The aim of SPRINT is to better understand the nature of intonational representations and the role of pragmatics and phonetic variability in shaping them in order to develop a phonological model of intonation that takes into consideration phonetic realization on the one hand, and intonation pragmatics on the other.

 

Forty years of the autosegmental-metrical theory of intonational phonology: an update and critical review in light of recent findings


It has been 40 years since Pierrehumbert’s seminal dissertation on “The phonology and phonetics of English intonation” which marked the beginning of the autosegmental-metrical theory of intonational phonology (henceforth AM). The success of AM has led to an explosion of research on intonation, but also brought a number of problems, such as the frequent conflation of phonetics and phonology and a return to long-questioned views on intonation. In this talk, I will first review the fundamental tenets of AM and address some common misconceptions that often lead to faulty comparisons with other models and questionable practices in intonation research more generally. I will also critically appraise the success of AM and review results emerging in the past decade, including results from my own recent research on English and Greek. These results suggest that some assumptions and research practices in AM and intonation research in general need to be reconsidered if we are to gain insight into the structure and functions of intonation crosslinguistically.


 Eric Fosler-Lussier

This email address is being protected from spambots. You need JavaScript enabled to view it.

io: Eric Fosler-Lussier is a Professor of Computer Science and Engineering, with courtesy appointments in Linguistics and Biomedical Informatics, at The Ohio State University. He is also co-Program Director for the Foundations of Artificial Intelligence Community of Practice at OSU's Translational Data Analytics Institute. After receiving a B.A.S. (Computer and Cognitive Science) and B.A. (Linguistics) from the University of Pennsylvania in 1993, he received his Ph.D. in 1999 from the University of California, Berkeley. He has also been a Member of Technical Staff at Bell Labs, Lucent Technologies, and held visiting positions at Columbia University and the University of Pennsylvania. He currently serves as the IEEE Speech and Language Technical Committee Chair and was co-General Chair of ASRU 2019 in Singapore. Eric's research has ranged over topics in speech recognition, dialog systems, and clinical natural language processing, which has been recognized in best paper awards from the IEEE Signal Processing Society and the International Medical Informatics Association.

Low resourced but long tailed spoken dialogue system building

In this talk, I discuss lessons learned from our partnership with the Ohio State School of Medicine in developing a Virtual Patient dialog system to train medical students in taking patient histories. The OSU Virtual Patients unusual development history as a question-answering system provides some interesting insights into co-development strategies for dialog systems. I also highlight our work in “speechifying” the patient chatbot and handling semantically subtle questions when speech data is non-existent and language exemplars for questions are few.


 Heiga Zen

This email address is being protected from spambots. You need JavaScript enabled to view it.

Bio: Heiga Zen received his PhD from the Nagoya Institute of Technology, Nagoya, Japan, in 2006. He was an Intern/Co-Op researcher at the IBM T.J. Watson Research Center, Yorktown Heights, NY (2004--2005), and a Research Engineer at Toshiba Research Europe Ltd. Cambridge Research Laboratory, Cambridge, UK (2008--2011). At Google, he was in the Speech team from July 2011 to July 2018, then joined the Brain team from August 2018. His research interests include speech technology and machine learning.

Title: Model-based text-to-speech synthesis


Ralf Schlueter

This email address is being protected from spambots. You need JavaScript enabled to view it.

Ralf Schlüter serves as Academic Director and Lecturer (Privatdozent) in the Department of Computer Science of the Faculty of Computer Science, Mathematics and Natural Sciences at RWTH Aachen University. He leads the Automatic Speech Recognition Group at the Lehrstuhl Informatik 6: Human Language Technology and Pattern Recognition. He studied physics at RWTH Aachen University and Edinburgh University and received his Diploma in Physics (1995),  in Computer Science (2000) and Habilitation for Computer Science (2019), each at RWTH Aachen University. Dr. Schlüter works on all aspects of automatic speech recognition and has been leading the scientific work of the Lehrstuhl Informatik 6 in the area of automatic speech recognition in many large national and international research projects, e.g.\ EU-Bridge and TC-STAR (EU), Babel (US-IARPA) and Quaero (French OSEO).

Automatic Speech Recognition in a State-of-Flux

Initiated by the successful utilization of deep neural network modeling for large vocabulary automatic speech recognition (ASR), the last decade brought a considerable diversification of ASR architectures. Following the classical state-of-the-art hidden Markov model (HMM) based architecture, connectionist temporal classification (CTC), attention-based encoder-decoder, recurrent neural network transducer (RNN-T) and monotonic variants, as well as segmental approaches including direct HMM architectures were introduced. All these architectures show competitive performance and the question arises, which of these will finally prevail and define the new state-of-the-art in large vocabulary ASR? In this presentation, a comparative review of current architectures in the context of Bayes decision rule is provided. Relations and equivalences between architectures are derived, utilization of data is considered and the role of language modeling within integrated end-to-end architectures will be discussed.