Second ESCA/IEEE Workshop on Speech Synthesis

September 12-15, 1994
Mohonk Mountain House, New Paltz, NY, USA

Coding fundamental frequency patterns for multi-lingual synthesis with INTSINT in the MULTEXT Project

Daniel Hirst, Nancy Ide, Jean Véronis

Laboratoire Parole et Langage, CNRS & Universite de Provence, Aix-en-Provence, France

MULTEXT (Multilingual Text Tools and Corpora) is the largest project funded under the European Commission's LRE (Linguistic Research and Engineering) Program. Intended to contribute to the development of generally usable software tools to manipulate and analyse multi-lingual text and speech, and to annotate multi-lingual text and speech corpora with structural and linguistic markup, it will attempt to establish conventions for the encoding of such corpora, building on and contributing to the preliminary recommendations of the relevant international and European standardization initiatives. MULTEXT will also work towards establishing a set of guidelines for linguistic software development, which will be widely published in order to enable future development by others. The project consortium, consisting of eight academic and research institutions and six major European industrial partners, is committed to making its results, namely corpus, tools, specifications and accompanying documentation, freely and publicly available.

At the outset of the project, the consortium will (in cooperation with the European Advisory Group on Language Engineering Standards, EAGLES) undertake to analyse, test and extend the SGML-based recommendations of the Text Encoding Initiative (TEI) on real-size data, and gradually develop encoding conventions specifically suited to multi-lingual corpora and the needs of NL and Speech corpus-based research. By using the emerging software tools, the consortium plans to produce a substantial annotated multilingual corpus, including parallel texts and spoken data, in six EC languages (English, French, Spanish, German, Italian and Dutch). The entire corpus will be marked for gross logical and structural features; subsets of the corpus will be marked and hand-validated for sentence and sub-sentence features, pan of speech, alignment of parallel texts, and prosody. All markup will have to comply to the TEI-based corpus encoding conventions established within the project. The corpus will also serve as a testbed for the project tools and a resource for future tool development and evaluation.

Bibliographic reference.  Hirst, Daniel / Ide, Nancy / Véronis, Jean (1994): "Coding fundamental frequency patterns for multi-lingual synthesis with INTSINT in the MULTEXT project", In SSW2-1994, 77-80.