ISCA Archive SPSC 2021
ISCA Archive SPSC 2021

‘How to Collect Speech Data with Human Rights in Mind’ - Language Resources, ethics and IPR

Khalid Choukri

Speech is one of the most conspicuous biometric characteristic of humans. Recent data-driven approaches are the basis of all Machine Learning/ Deep Learning techniques. In addition to being a biometric dimension, speech signal carries various information about speaker gender, affective and emotions, etc., while the recorded audio content may contain private or confidential information. The signal may even reflect the environment in which it was recorded, making it identifiable. All these aspects have to be addressed in data collection and production. In some contexts, speakers’ informed consent would be sufficient, while in many others ethical and legal issues have to be carefully considered. One may imagine a data collection that has to reflect real emotions of the speakers e.g. fear, sadness, joy, etc. The debate on how to provoke such affective reactions, from multiple views such as ethical, psychological, cultural, etc., is essential. Should the collection simulate fears or provoke real ones? Should the speakers be informed in advance or not, and what if the speakers are kids? In all circumstances, data production implies high costs and heavy processes which lead to ownership and intellectual property right reflection. Fair behavior, but also current regulations (e.g. GDPR in Europe), require that speakers can withdraw their consent any time, including years after the packaging of the data. The EU imposes that some resources can only be shared with countries that adopted similar regulations. How can one comply with such commitments while sharing the resource with the community at large under very permissive licences that do not allow to monitor all uses made of the data? Last but not least, what happens if one obtains, in very good faith, a language data set from sources that do not comply with these requirements and discover, once in use, the process infringes some of these principles? This workshop aims at opening the debate on all these aspects to better share current practices, learn from other disciplines, and contemplate good/best practices in the field.


Cite as: Choukri, K. (2021) ‘How to Collect Speech Data with Human Rights in Mind’ - Language Resources, ethics and IPR. Proc. 2021 ISCA Symposium on Security and Privacy in Speech Communication,

@inproceedings{choukri21_spsc,
  author={Khalid Choukri},
  title={{‘How to Collect Speech Data with Human Rights in Mind’  - Language Resources, ethics and IPR}},
  year=2021,
  booktitle={Proc. 2021 ISCA Symposium on Security and Privacy in Speech Communication},
  pages={}
}