ISCA Archive IberSPEECH 2022
ISCA Archive IberSPEECH 2022

CORAA NURC-SP Minimal Corpus: a manually annotated corpus of Brazilian Portuguese spontaneous speech

Vinícius G. Santos, Caroline Adriane Alves, Bruno Baldissera Carlotto, Bruno Angelo Papa Dias, Lucas Rafael Stefanel Gris, Renan de Lima Izaias, Maria Luiza Azevedo de Morais, Paula Marin de Oliveira, Rafael Sicoli, Flaviane Romani Fernandes Svartman, Marli Quadros Leite, Sandra Maria Aluísio

With the advent of technology, the availability of linguistic data in digital format has been increasingly encouraged to facilitate its use not only in different areas of Linguistics but also in related areas, such as natural language processing. Inspired by a protocol for digitizing the NURC (‘Cultured Linguistic Urban Norm’) project collection — one of the most influential in Brazilian Linguistics —, this paper aims to present the textto-speech alignment process of the NURC-Sao Paulo Minimal ˜ Corpus. This subcorpus comprises 21 audio files and audioaligned multilevel transcripts according to linguistically motivated intonation units (≈18 hours, ≈155 k words), covering three text genres. The dataset — currently used to evaluate methods for processing the entire NURC-SP corpus — is publicly available on the Portulan Clarin repository [CC BY-NCND 4.0] (https://hdl.handle.net/21.11129/0000-000F-73CA-C).


doi: 10.21437/IberSPEECH.2022-33

Cite as: Santos, V.G., Alves, C.A., Carlotto, B.B., Papa Dias, B.A., Stefanel Gris, L.R., Lima Izaias, R.d., Azevedo de Morais, M.L., Marin de Oliveira, P., Sicoli, R., Svartman, F.R.F., Leite, M.Q., Aluísio, S.M. (2022) CORAA NURC-SP Minimal Corpus: a manually annotated corpus of Brazilian Portuguese spontaneous speech . Proc. IberSPEECH 2022, 161-165, doi: 10.21437/IberSPEECH.2022-33

@inproceedings{santos22_iberspeech,
  author={Vinícius G. Santos and Caroline Adriane Alves and Bruno Baldissera Carlotto and Bruno Angelo {Papa Dias} and Lucas Rafael {Stefanel Gris} and Renan de {Lima Izaias} and Maria Luiza {Azevedo de Morais} and Paula {Marin de Oliveira} and Rafael Sicoli and Flaviane Romani Fernandes Svartman and Marli Quadros Leite and Sandra Maria Aluísio},
  title={{CORAA NURC-SP Minimal Corpus: a manually annotated corpus of Brazilian Portuguese spontaneous speech }},
  year=2022,
  booktitle={Proc. IberSPEECH 2022},
  pages={161--165},
  doi={10.21437/IberSPEECH.2022-33}
}