1st Joint SIG-IL/Microsoft Workshop on Speech and Language Technologies for Iberian Languages

Porto Salvo, Portugal
September 3-4, 2009

Automatic Recovery of Punctuation Marks and Capitalization Information for Iberian Languages

Fernando Batista (1,2), Isabel Trancoso (1,3), Nuno Mamede (1,3)

(1) L2F - Spoken Language Systems Laboratory - INESC ID Lisboa
(2) DCTI ISCTE - Institute of Science, Technology and Management, Portugal
(3) IST Technical University of Lisbon, Portugal

This paper shows experimental results concerning automatic enrichment of the speech recognition output with punctuation marks and capitalization information. The two tasks are treated as two classification problems, using a maximum entropy modeling approach. The approach is language independent as reinforced by experiments performed on Portuguese and Spanish Broadcast News corpora. The discriminative models are trained for a language using spoken and written corpora from that language. This paper provides the first results on Spanish Broadcast News data and the first comparative study between Portuguese and Spanish, on this subject.

Index Terms: Rich Transcription, Capitalization, Punctuation marks, Speech processing

Full Paper

Bibliographic reference.  Batista, Fernando / Trancoso, Isabel / Mamede, Nuno (2009): "Automatic recovery of punctuation marks and capitalization information for Iberian languages", In SLTECH-2009, 99-102.