September 22-25, 1997
This paper describes a system for the automatic extraction of diphone units from given speech utterances. The method is based on an automatic phonetic segmentation and on a subsequent rule-driven diphone boundary detection. The phonetic segmenter, developed at IRST, was trained and tested both in speaker independent and speaker dependent mode. A rule formalism, involving acoustic parameters, arithmetical and logical operators, was defined to express the acoustic/phonetic knowledge acquired during previous experiences on manual diphone segmentation. A specialized tool for rule parsing was designed that processes a given sequence of automatically derived phone boundaries using a corresponding sequence of predefined acoustic parameters. Several sets of rules were developed that include both general principles and specific details concerning the content of the diphone database of "Eloquens"N, the CSELT text-to-speech synthesis system for the Italian language. The accuracy was evaluated by comparing the manual and the automatic segmentations of the speech utterances of a female speaker, resulting in nearly 95% of correct boundary position, given a tolerance of 20 ms.
Bibliographic reference. Angelini, Bianca / Barolo, Claudia / Falavigna, Daniele / Omologo, Maurizio / Sandri, Stefano (1997): "Automatic diphone extraction for an Italian text-to-speech synthesis system", In EUROSPEECH-1997, 581-584.