A Small Griko-Italian Speech Translation Corpus

Marcely Zanon Boito, Antonios Anastasopoulos, Aline Villavicencio, Laurent Besacier, Marika Lekakou

´╗┐This paper presents an extension to a very low-resource parallel corpus collected in an endangered language, Griko, making it useful for computational research. The corpus consists of 330 utterances (about 2 hours of speech) which have been transcribed and translated in Italian, with annotations for word-level speech-to-transcription and speech-to-translation alignments. The corpus also includes morpho syntactic tags and word-level glosses. Applying an automatic unit discovery method, pseudo-phones were also generated. We detail how the corpus was collected, cleaned and processed, and we illustrate its use on zero-resource tasks by presenting some baseline results for the task of speech-to-translation alignment and unsupervised word discovery. The dataset will be available online, aiming to encourage replicability and diversity in computational language documentation experiments.

 DOI: 10.21437/SLTU.2018-8

Cite as: Zanon Boito, M., Anastasopoulos, A., Villavicencio, A., Besacier, L., Lekakou, M. (2018) A Small Griko-Italian Speech Translation Corpus. Proc. The 6th Intl. Workshop on Spoken Language Technologies for Under-Resourced Languages, 36-41, DOI: 10.21437/SLTU.2018-8.

@inproceedings{Zanon Boito2018,
  author={Marcely {Zanon Boito} and Antonios Anastasopoulos and Aline Villavicencio and Laurent Besacier and Marika Lekakou},
  title={{A Small Griko-Italian Speech Translation Corpus}},
  booktitle={Proc. The 6th Intl. Workshop on Spoken Language Technologies for Under-Resourced Languages},