Sixth International Conference on Spoken Language Processing
(ICSLP 2000)

Beijing, China
October 16-20, 2000

Using F0 within a Phonologically Motivated Method of Unit Selection

Andrew Breen (1), James Salter (2)

(1) School of Information Systems, UEA, UK, (2) HTK, UK

The current generation of concatenative speech synthesis systems rely on the selection of appropriate pre-recorded speech units from a repository of sounds. This process, commonly referred to as unit selection, is a critical step in the production of natural sounding speech. However the process of unit selection is only as good as the labelling strategy used and the quality and style of the recordings. Simply stated, the unit selection process cannot select that which isn't labelled or recorded. These units once selected must be seamlessly concatenated and prosodically modified to reflect the desired rhythm and intonation. Traditionally this has been viewed as a signal-processing step. The most popular algorithms are based on (Pitch Synchronous Overlap Add) PSOLA or Harmonic plus noise (HMN) models, each has its strengths and weaknesses. Some researchers are of the opinion that signal processing should be kept to a minimum, as a result they have concentrated on building systems with extremely large databases of recorded speech. Unit selection within such systems has a vast amount of data from which to select an appropriate unit. As such, selected sounds tend to be close to the desired phonemic and prosodic contexts, the result of which is that little post selection signal processing is required. However such approaches have a number of practical and commercial disadvantages. Practically, large databases are difficult to record, annotate and manage while a large program footprint is commercially impractical for a number of applications.

The Laureate Text to Speech system, originally developed at BT Adastral Park, differs from the approaches discussed above in that it does not use any acoustic properties of the speech signal in the unit selection process. Instead, it relies solely on a rich phonological representation in the labels associated with the speech data. However, such an approach does not take into account limitations in the post unit selection signal processing. The method described in this paper extends the basic Laureate philosophy to include sensitivity to the method of signal processing used within the system. In this technique, a systematic approach to the addition of speech data is adopted, which enables the developer to trade off quality against computational load and storage. The paper describes how multiple copies of the repository of speech data used in the unit selection process, recorded at different fundamental frequencies, may be used within the selection process, and how these multiple recordings are used to automatically control the degree of signal processing applied to the selected units.

Full Paper

Bibliographic reference.  Breen, Andrew / Salter, James (2000): "Using F0 within a phonologically motivated method of unit selection", In ICSLP-2000, vol.1, 705-708.