Sixth International Conference on Spoken Language Processing
(ICSLP 2000)

Beijing, China
October 16-20, 2000

A Flexible, Scalable Finite-State Transducer Architecture for Corpus-Based Concatenative Speech Synthesis

Jon R. W. Yi, James R. Glass, I. Lee Hetherington

Spoken Language Systems Group, Laboratory for Computer Science, Massachusetts Institute of Technology, Cambridge, MA, USA

In this paper we describe our work involving the conversion of our phonologically-based synthesizer into a finite-state transducer (FST) representation which can be used for real-time natural-sounding synthesis. We have designed a transducer structure to efficiently perform the common task of unit selection in concatenative speech synthesis. By encapsulating domainindependent concatenative synthesis costs into a constraint kernel, we have obtained a topology that scales linearly with the size of the synthesis corpus. The FST representation provides a flexible, unified framework in which we can leverage our previous work in speech recognition in areas such as pronunciation modelling and search. The FST synthesizer has been incorporated into two servers which operate within our conversational system architecture to convert meaning representations into waveforms. We have had preliminary success with the new FST-based synthesis in several constrained spoken dialogue applications.


Full Paper

Bibliographic reference.  Yi, Jon R. W. / Glass, James R. / Hetherington, I. Lee (2000): "A flexible, scalable finite-state transducer architecture for corpus-based concatenative speech synthesis", In ICSLP-2000, vol.3, 322-325.