Sixth International Conference on Spoken Language Processing
The corpus is composed of read sentences in Brazilian Portuguese, similar to sentences found in the TIMIT corpus, as well as answers to questions such as the speaker’s name, address, telephone number, ZIP code, and other information. The data were recorded at 44 kHz with a direct connection from the microphone to the sound card. The corpus contains information from about 200 speakers, although future development efforts will expand the corpus size to 1000 speakers. The paper covers in some detail the protocol used to design this corpus and the methods of data collection.
An HMM/ANN-hybrid continuous digits recognizer developed using a small subset of this corpus has 96.18% word-level accuracy and 78.95% sentence level accuracy. This recognizer was trained on 48 files, developed using 11 files, and tested on 19 files, with an average of 5 digits per file. A total of 103 context-dependent categories were used in training. A generalpurpose recognizer capable of recognizing arbitrary words is currently under development.
This article is within the context of the Spoltech Project that is a project on computational linguistic research. It aims to create, develop and improve the technologies of speech synthesis and recognition. This interdisciplinary project is composed of researchers, teachers and students of the Instituto de Informática and Instituto de Letras (Language and Literature College) of the Universidade Federal do Rio Grande do Sul, the Departamento de Informática of the Universidade de Caxias do Sul, CSLR/CU (University of Colorado, Boulder) and CSLU/OGI (Oregon Graduate Institute).
Bibliographic reference. Schramm, Mauricio C. / Freitas, Luis Felipe R. / Zanuz, Adriano / Barone, Dante (2000): "A brazilian portuguese language corpus development", In ICSLP-2000, vol.2, 579-582.