ISCA Archive SSW 2016
ISCA Archive SSW 2016

Siri’s voice gets deep learning

Alex Acero

In iOS 10, the new Siri voices are built on a hybrid speech synthesizer leveraging deep learning. The goodness of a concatenation between two units is modeled by a Gaussian distribution on the acoustic vectors (MFCC, F0, and their deltas) with the means and variances being a function of the linguistic features. The goodness of a target is modeled similarly with the addition of duration to the acoustic vector. The means and variances of these Gaussians are obtained through a Mixture Density Network. The new Siri voices are more natural, smoother, and allow Siri’s personality to shine through.


Cite as: Acero, A. (2016) Siri’s voice gets deep learning. Proc. 9th ISCA Workshop on Speech Synthesis Workshop (SSW 9),

@inproceedings{acero16_ssw,
  author={Alex Acero},
  title={{Siri’s voice gets deep learning}},
  year=2016,
  booktitle={Proc. 9th ISCA Workshop on Speech Synthesis Workshop (SSW 9)},
  pages={}
}