Sixth International Conference on Spoken Language Processing
(ICSLP 2000)

Beijing, China
October 16-20, 2000

A Brazilian Portuguese Language Corpus Development

Mauricio C. Schramm, Luis Felipe R. Freitas, Adriano Zanuz, Dante Barone

Universidade Federal do Rio Grande do Sul, Instituto de Informática, Brazil

This article presents the techniques that are being used for the creation of a database related to the Brazilian Portuguese language. This database is composed of a collection of recorded voices, from different speakers and different regions of Brazil. The collected voices contain varied phonetic and phonologic information. The applications of this database are diverse, including synthesis and recognition systems and data for linguistic studies.

The corpus is composed of read sentences in Brazilian Portuguese, similar to sentences found in the TIMIT corpus, as well as answers to questions such as the speaker’s name, address, telephone number, ZIP code, and other information. The data were recorded at 44 kHz with a direct connection from the microphone to the sound card. The corpus contains information from about 200 speakers, although future development efforts will expand the corpus size to 1000 speakers. The paper covers in some detail the protocol used to design this corpus and the methods of data collection.

An HMM/ANN-hybrid continuous digits recognizer developed using a small subset of this corpus has 96.18% word-level accuracy and 78.95% sentence level accuracy. This recognizer was trained on 48 files, developed using 11 files, and tested on 19 files, with an average of 5 digits per file. A total of 103 context-dependent categories were used in training. A generalpurpose recognizer capable of recognizing arbitrary words is currently under development.

This article is within the context of the Spoltech Project that is a project on computational linguistic research. It aims to create, develop and improve the technologies of speech synthesis and recognition. This interdisciplinary project is composed of researchers, teachers and students of the Instituto de Informática and Instituto de Letras (Language and Literature College) of the Universidade Federal do Rio Grande do Sul, the Departamento de Informática of the Universidade de Caxias do Sul, CSLR/CU (University of Colorado, Boulder) and CSLU/OGI (Oregon Graduate Institute).

Bibliographic reference.  Schramm, Mauricio C. / Freitas, Luis Felipe R. / Zanuz, Adriano / Barone, Dante (2000): "A brazilian portuguese language corpus development", In ICSLP-2000, vol.2, 579-582.