Third Workshop on Spoken Language Technologies for Under-resourced Languages

Cape Town, South Africa
May 7-9, 2012

Almannarómur: An Open Icelandic Speech Corpus

Jón Guðnason (1), Oddur Kjartansson (1), Jökull Jóhannsson (1), Elín Carstensdóttir (1), Hannes Högni Vilhjálmsson (1), Hrafn Loftsson (1), Sigrún Helgadóttir (2), Kristín M. Jóhannsdóttir (3), Eiríkur Rögnvaldsson (3)

(1) Reykjavik University, Iceland
(2) The Árni Magnússon Institute for Icelandic Studies, Reykjavik, Iceland
(3) University of Iceland, Reykjavik, Iceland

The purpose of the Almannarómur project is collecting data for a speech corpus (database) for Icelandic. Its main aim is creating an open source speech project to enable research and development for Icelandic language technology. The database is particularly suitable for acoustic modelling for speech recognition but it could also be used for other purposes, such as to develop a speaker recognition system or to analyze prosody. The project is run by Reykjavik University and the Icelandic Centre for Language Technology in cooperation with Google who provided technical support. The number of participants achieved in this effort was 563, providing, on average, around 219 read sentences each. This paper gives a short introduction to Icelandic language technology, describes how the text corpus was constructed for the database, and presents how the recording effort was organized as well as its main results.

Index Terms: Icelandic, Speech Recording, Corpus Creation, Automatic Speech Recognition

Full Paper

Bibliographic reference.  Guðnason, Jón / Kjartansson, Oddur / Jóhannsson, Jökull / Carstensdóttir, Elín / Vilhjálmsson, Hannes Högni / Loftsson, Hrafn / Helgadóttir, Sigrún / Jóhannsdóttir, Kristín M. / Rögnvaldsson, Eiríkur (2012): "Almannarómur: an open icelandic speech corpus", In SLTU-2012, 80-83.