Bilingual Prosodic Dataset Compilation for Spoken Language Translation

Alp Öktem, Mireia Farrús, Antonio Bonafonte


This paper builds on a previous methodology that exploits dubbed media material to build prosodically annotated bilingual corpora. The almost fully-automatized process serves for building data for training spoken language models without the need for designing and recording bilingual data. The methodology is put into use by compiling an English-Spanish parallel corpus using a recent TV series. The collected corpus contains 7225 parallel utterances totaling to about 10 hours of data annotated with speaker information, word-alignments and word-level acoustic features. Both the extraction scripts and the dataset are distributed open-source for research purposes.


 DOI: 10.21437/IberSPEECH.2018-5

Cite as: Öktem, A., Farrús, M., Bonafonte, A. (2018) Bilingual Prosodic Dataset Compilation for Spoken Language Translation. Proc. IberSPEECH 2018, 20-24, DOI: 10.21437/IberSPEECH.2018-5.


@inproceedings{Öktem2018,
  author={Alp Öktem and Mireia Farrús and Antonio Bonafonte},
  title={{Bilingual Prosodic Dataset Compilation for Spoken Language Translation}},
  year=2018,
  booktitle={Proc. IberSPEECH 2018},
  pages={20--24},
  doi={10.21437/IberSPEECH.2018-5},
  url={http://dx.doi.org/10.21437/IberSPEECH.2018-5}
}