Interspeech'2005 - Eurospeech
While there are numerous methods for estimating the fundamental frequency (F0) of speech, existing methods often suffer from pitch doubling or halving errors. Heuristics can be added to constrain the range of allowable F0 values, but it is still difficult to appropriately set the algorithm parameters if one does not know in advance the speaker's age or gender. The proposed method is distinct from most other F0-estimation algorithms in that it does not use autocorrelation, cepstral, or pattern-recognition techniques. Instead, information from 32 band-pass filters is combined at every frame, a Viterbi search provides an initial F0-contour estimate, and this estimate is then refined based on intensity discrimination of the speech signal. Despite the use of a large number of filters (which provide complementary information and hence robustness), the implementation works in less than real-time on a 2.4 GHz processor without optimization for processing speed. Results are presented for two corpora, one corpus of an adult male and one of children of different ages. For the first corpus, average absolute error is 4.10 Hz (percent error of 4.15%); for the second corpus, average absolute error is 7.74 Hz (percent error of 3.38%).
Bibliographic reference. Hosom, John-Paul (2005): "F0 estimation for adult and children's speech", In INTERSPEECH-2005, 317-320.