11th Annual Conference of the International Speech Communication Association

Makuhari, Chiba, Japan
September 26-30. 2010

Age and Gender Recognition Based on Multiple Systems - Early vs. Late Fusion

Tobias Bocklet (1), Georg Stemmer (2), Viktor Zeissler (3), Elmar Nöth (1)

(1) FAU Erlangen-Nürnberg, Germany
(2) SVOX Deutschland GmbH, Germany
(3) Elektrobit Germany, Germany

This paper focuses on the automatic recognition of a personís age and gender based only on his or her voice. Up to five different systems are compared and combined in different con?gurations: three systems model the speakerís characteristics in different feature spaces, i.e., MFCC, PLP, TRAPS, by Gaussian mixture models. The features of these systems are the concatenated mean vectors. System number 4 uses a physical two-mass vocal model and estimates in a data-driven optimization procedure 9 glottal features from voiced speech sections. For each utterance the minimum, maximum and mean vectors form a 27-dimensional feature vector. The last system calculates a 219-dimensional prosodic feature set for each utterance based on voice and unvoiced speech segments. We compare two different ways to fuse the different systems: First, we concatenate the system on feature level. The second way of combination is performed on score level by multi-class logistic regression. Despite there are just minor differences between the two approaches, late fusion is slightly superior. On the development set of the Interspeech Agender challenge we achieved an unweighted recall of 46.1% with early fusion and 47.8% with late fusion.

Full Paper

Bibliographic reference.  Bocklet, Tobias / Stemmer, Georg / Zeissler, Viktor / Nöth, Elmar (2010): "Age and gender recognition based on multiple systems - early vs. late fusion", In INTERSPEECH-2010, 2830-2833.