In iOS 10, the new Siri voices are built on a hybrid speech synthesizer leveraging deep learning. The goodness of a concatenation between two units is modeled by a Gaussian distribution on the acoustic vectors (MFCC, F0, and their deltas) with the means and variances being a function of the linguistic features. The goodness of a target is modeled similarly with the addition of duration to the acoustic vector. The means and variances of these Gaussians are obtained through a Mixture Density Network. The new Siri voices are more natural, smoother, and allow Siri’s personality to shine through.
Cite as: Acero, A. (2016) Siri’s voice gets deep learning. Proc. 9th ISCA Workshop on Speech Synthesis Workshop (SSW 9),
@inproceedings{acero16_ssw, author={Alex Acero}, title={{Siri’s voice gets deep learning}}, year=2016, booktitle={Proc. 9th ISCA Workshop on Speech Synthesis Workshop (SSW 9)}, pages={} }