INTERSPEECH 2004 - ICSLP
This work proposes a method of predicting pitch and voicing from mel-frequency cepstral coefficient (MFCC) vectors. Two maximum a posteriori (MAP) methods are considered. The first models the joint distribution of the MFCC vector and pitch using a Gaussian mixture model (GMM) while the second method also models the temporal correlation of the pitch contour using a combined hidden Markov model (HMM)-GMM framework. Monophone-based HMMs are connected together in the form of an unconstrained monophone grammar which enables pitch to be predicted from unconstrained speech input. Evaluation on 130,000 MFCC vectors reveals a voicing classification accuracy of over 92% and an RMS pitch error of 10Hz. The predicted pitch contour is also applied to MFCC-based speech reconstruction with the resultant speech almost indistinguishable from that reconstructed using a reference pitch.
Bibliographic reference. Shao, Xu / Milner, Ben P. (2004): "MAP prediction of pitch from MFCC vectors for speech reconstruction", In INTERSPEECH-2004, 2425-2428.