The world of statistical parametric speech synthesis continues to improve with recent investigations of different machine learning techniques to better model spectrum, F0 and duration from corpora of natural speech. Traditional techniques rely on decision trees alone. This paper shows the advantages of modeling with random forests of decision trees over single trees. Improvements equivalent to more than doubling the data can be achieved, offering end users significantly better synthesis from the same data size. These techniques give proportionally more improvements on smaller datasets, particularly with voices with only 30 minutes of speech. These techniques have been tested over a wide range of voices and languages of various sizes and quality, producing significant improvements in all cases. These techniques are documented, and robustly implemented for others to use through the Dec 2014 release of the Festvox voice building toolkit, thereby directly allowing these benefits to be used in standard voices build for the Festival Speech Synthesis System and CMU Flite.
Bibliographic reference. Black, Alan W. / Muthukumar, Prasanna Kumar (2015): "Random forests for statistical speech synthesis", In INTERSPEECH-2015, 1211-1215.