ISCA - International Speech
Communication Association

  • Home
  • Internship @ Telecom Paris, Paris, France

Internship @ Telecom Paris, Paris, France

2024-01-04 15:16 | Anonymous


Automatic speech analysis of public talks. 

Description. Today, humanity has reached a stage at which extremely important aspects (such as information exchange) are tied not only to actual so-called hard skills, but also to soft skills. One such important skill is public speaking. Like many forms of interaction between people, the assessment of public speaking depends on many factors (often subjectively perceived). The goal of our project is to create an automatic system which can take into account these different factors and evaluate the quality of the performance. This requires understanding which elements can be assessed objectively and which vary depending on the listener [Hemamou, Wortwein, Chollet21]. For such an analysis, it is necessary to analyze public speaking at various levels: high-level (audio, video, text), intermediate (voice monotony, auto-gestures, speech structure, and etc.) and low-level (fundamental frequency, action units, POS / tags, and etc.) [Barkar]. This internship offers an opportunity to analyze the audio component of a public speech. The student is asked to solve two main problems. The engineering task is to create an automatic speech transcription system that detects speech disfluency. To do this, it is proposed to collect a bibliography on this topic and come up with an engineering solution. The second, research task, is to use audio cues to automatically analyze the success of a performance of a talk. This internship will give you an opportunity to solve an engineering problem as well as learn more about research approaches. By the end you will have expertise in audio processing as well and machine learning methods for multimodal analysis. If the internship is successfully completed, an article may be published. PhD position fundings on Social Computing will be accessible in the team at the end of the internship (at INRIA).

Registration & Organisation. Name of organization: Institut Polytechnique de Paris, Telecom-Paris Website of organization: Department: IDS/LTCI/ Address: Palaiseau, France

Supervision. Supervision will include weekly meetings with the main supervisor and regular meetings (every 2-3 weeks) with co-supervisors. Telecom-Paris, 2023-2024 ANR Project «REVITALISE» Name of supervisor: Alisa BARKAR Name of co-supervisor: Chloe Clavel, Mathieu Chollet, Béatrice BIANCARDI Contact details:

Duration & Planning. The internship is planned as a 5-6 month full-time internship for the spring semester 2024. 6 months considers 24 weeks within which it will be covered following list of activities:

● ACTIVITY 1(A1): Problem description and integration to the working environment

● ACTIVITY 2(A2): Bibliography overview

● ACTIVITY 3(A3): Implementation of the automatic transcription with detected discrepancies

● ACTIVITY 4(A4): Evaluation of the automatic transcription

● ACTIVITY 5(A5): Application of the developed methods to the existing data

● ACTIVITY 6(A6): Analysis of the importance of para-verbal features for the performance perception

● ACTIVITY 7(A7): Writing the report

Selected references of the team.

1. [Hemamou] L. Hemamou, G. Felhi, V. Vandenbussche, J.-C. Martin, C. Clavel, HireNet: a Hierarchical Attention Model for the Automatic Analysis of Asynchronous Video Job Interviews. in AAAI 2019, to appear

2. [Ben-Youssef] Atef Ben-Youssef, Chloé Clavel, Slim Essid, Miriam Bilac, Marine Chamoux, and Angelica Lim. Ue-hri: a new dataset for the study of user engagement in spontaneous human-robot interactions. In Proceedings of the 19th ACM International Conference on Multimodal Interaction, pages 464–472. ACM, 2017.

3. [Wortwein] Torsten Wörtwein, Mathieu Chollet, Boris Schauerte, Louis-Philippe Morency, Rainer Stiefelhagen, and Stefan Scherer. 2015. Multimodal Public Speaking Performance Assessment. In Proceedings of the 2015 ACM on International Conference on Multimodal Interaction (ICMI '15). Association for Computing Machinery, New York, NY, USA, 43–50.

4. [Chollet21] Chollet, M., Marsella, S., & Scherer, S. (2021). Training public speaking with virtual social interactions: effectiveness of real-time feedback and delayed feedback. Journal on Multimodal User Interfaces, 1-13.

5. [Barkar] Alisa Barkar, Mathieu Chollet, Beatrice Biancardi, and Chloe Clavel. 2023. Insights Into the Importance of Linguistic Textual Features on the Persuasiveness of Public Speaking. In Companion Publication of the 25th International Conference on Multimodal Interaction (ICMI '23 Companion). Association for Computing Machinery, New York, NY, USA, 51–55. Telecom-Paris, 2023-2024 ANR Project «REVITALISE»

Other references. 

1. Dinkar, T., Vasilescu, I., Pelachaud, C. and Clavel, C., 2020, May. How confident are you? Exploring the role of fillers in the automatic prediction of a speaker’s confidence. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 8104-8108). IEEE.

2. Whisper: Robust Speech Recognition via Large-Scale Weak Supervision, Radford A. et al., 2022, url:

3. Romana, Amrit and Kazuhito Koishida. “Toward A Multimodal Approach for Disfluency Detection and Categorization.” ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2023): 1-5.

4. Radhakrishnan, Srijith et al. “Whispering LLaMA: A Cross-Modal Generative Error Correction Framework for Speech Recognition.” ArXiv abs/2310.06434 (2023): n. pag.

5. Wu, Xiao-lan et al. “Explanations for Automatic Speech Recognition.” ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2023): 1-5.

6. Min, Zeping and Jinbo Wang. “Exploring the Integration of Large Language Models into Automatic Speech Recognition Systems: An Empirical Study.” ArXiv abs/2307.06530 (2023): n. pag.

7. Ouhnini, Ahmed et al. “Towards an Automatic Speech-to-Text Transcription System: Amazigh Language.” International Journal of Advanced Computer Science and Applications (2023): n. pag.

8. Bigi, Brigitte. “SPPAS: a tool for the phonetic segmentations of Speech.” (2023).

9. Rekesh, Dima et al. “Fast Conformer with Linearly Scalable Attention for Efficient Speech Recognition.” ArXiv abs/2305.05084 (2023): n. pag.

10. Arisoy, Ebru et al. “Bidirectional recurrent neural network language models for automatic speech recognition.” 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2015): 5421-5425.

11. Padmanabhan, Jayashree and Melvin Johnson. “Machine Learning in Automatic Speech Recognition: A Survey.” IETE Technical Review 32 (2015): 240 - 251.

12. Berard, Alexandre et al. “End-to-End Automatic Speech Translation of Audiobooks.” 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2018): 6224-6228.

13. Kheir, Yassine El et al. “Automatic Pronunciation Assessment - A Review.” ArXiv abs/2310.13974 (2023): n. pag. Telecom-Paris, 2023-2024

 Organisation  Events   Membership   Help 
 > Board  > Interspeech  > Join - renew  > Sitemap
 > Legal documents  > Workshops  > Membership directory  > Contact
 > Logos      > FAQ
       > Privacy policy

© Copyright 2024 - ISCA International Speech Communication Association - All right reserved.

Powered by Wild Apricot Membership Software