Third ESCA/COCOSDA Workshop on Speech Synthesis
November 26-29, 1998
In the context of the realisation of a Text-To-Speech system for Irish, a new algorithm for speech synthesis has been developed. This algorithm, which achieves synthesis by concatenation of diphones, is based principally on two classical signal processing techniques: the linear prediction and the Overlap and Add (OLA). Unlike the well-known TD-PSOLA method, no pitch marking is required; instead, the recorded segments are modified in order to produce pitch constant signals. Thus, the OLA procedures are applied to broad windows especially during concatenation, enabling a spectral smoothing of the transition between the diphones.
An initial pitch modification, energy equalisation and, if necessary, a lengthening of the shorter sounds are carried out. The actual synthesis then consists of two modules: concatenation and prosody matching, including pitch and duration modification.
The pitch modification (both in the initialisation stage and in the prosody matching) is realised through a linear prediction analysis of the signal, producing estimates of the vocal tract filter and the glottal signal. In order to modify the pitch without changing the formant frequencies, an interpolation (or decimation) is applied to each period of the glottal signal according to the required pitch modification rate.
The duration modification is based on the time-scale modification algorithm proposed by Roucos and Wilgus , called the Synchronous Overlap and Add algorithm. The method and the computation of its parameters have been optimised, producing a very high quality time-scale modification.
Finally, the concatenation module consists of overlapping the common phoneme of the two diphones being concatenated. A computation of their cross correlation allows us to synchronise them avoiding phase mismatch. The constant pitch allows a large overlap of the signals. Before their addition, two half hamming windows (the first one is decreasing and the second one is increasing) are applied to the signals to generate a smooth spectral transition.
The algorithm has been tested on Irish sentences. The diphones have been extracted from a corpus recorded by an Irish speaker, trying hard to keep a constant pitch during the pronunciation to facilitate the initial pitch modification. The prosody of the sentence have been defined from a reference pronunciation of the same sentence. The synthesised sentence is fairly clear with some degree of naturalness.
Bibliographic reference. Charonnat, L. / Ó-Néill, G. / Mercier, Guy (1998): "An Irish Speech Synthesiser", In SSW3-1998, 243-248.