ISCA Archive SPSC 2021
ISCA Archive SPSC 2021

‘How to Collect Speech Data with Human Rights in Mind’ - The legal view

Catherine Jasserand-Breeman

Recently Facebook released a large-scale dataset of speech data for research purposes, VoxPopuli, see The data were not scrapped from the Internet but extracted from public event recordings made available on the European Parliament, see If the materials published on the website are not subject to copyright restrictions (Either because they are copyright-free or copyright-holders have waived their rights.), they can be still subject to other rights and conditions (such as personality rights or data protection rules). Yet, it does not seem that Facebook acknowledged it and even assessed the legal basis under which the personal data contained in the recordings could be processed. Copyright and data protection issues are often mixed up. The increased use of Creative Commons licences (allowing the re-use of copyrightable works) to release large-scale datasets (containing personal data) illustrate it; see also Catherine Jasserand, ‘Free to re-use? The case of facial images scrapped from the Internet and compiled in mega research datasets’.

Data publicly made available cannot be re-used based on their availability. If they are personal data, they still need to be processed under one of the legal grounds identified in the GDPR (or other applicable data protection legislation) (Article 6 and Article 9 GDPR). When data are collected directly from the data subjects, one could rely on consent\, and explicit consent for sensitive data. But what could be the legal basis when the data are obtained from third parties? Could it be the legitimate interest of the data controllers? (Art. 6(1)(f) GDPR) The performance of a task carried out in the public interest? (Art. 6(1)(e) GDPR) Or the research exception allowing the processing of sensitive data? (Art. 9 (2)(j) GDPR) A rigorous analysis of all the legal grounds provided by the GDPR is needed. But besides identifying a possible legal basis, complying with data subject’s rights and data protection principles when the data originates from third parties constitutes another challenge. Last but not least, regulators (e.g. the European Data Protection Supervisor) and human rights organizations (e.g. the Council of Europe) seem to support the use of synthetic data to develop and train AI models at large-scale, see; see Council of Europe’s Guidelines on Facial Recognition (2021) and

Cite as: Jasserand-Breeman, C. (2021) ‘How to Collect Speech Data with Human Rights in Mind’ - The legal view. Proc. 2021 ISCA Symposium on Security and Privacy in Speech Communication,

  author={Catherine Jasserand-Breeman},
  title={{‘How to Collect Speech Data with Human Rights in Mind’  - The legal view}},
  booktitle={Proc. 2021 ISCA Symposium on Security and Privacy in Speech Communication},