The combination of several heterogeneous systems is known to provide remarkable performance improvements in verification and detection tasks. In Spoken Term Detection (STD), two important issues arise: (1) how to define a common set of detected candidates, and (2) how to combine system scores to produce a single score per candidate. In this paper, a discriminative calibration/fusion approach commonly applied in speaker and language recognition is adopted for STD. Under this approach, we first propose several heuristics to hypothesize scores for systems that do not detect a given candidate. In this way, the original problem of several unaligned detection candidates is converted into a verification task. As for other verification tasks, system weights and offsets are then estimated through linear logistic regression. As a result, the combined scores are well calibrated, and the detection threshold is automatically given by application parameters (priors and costs). The proposed method not only offers an elegant solution for the problem of fusion and calibration of multiple detectors, but also provides consistent improvements over a baseline approach based on majority voting, according to experiments on the MediaEval 2012 Spoken Web Search (SWS) task involving 8 heterogeneous systems developed at two different laboratories.
Bibliographic reference. Abad, Alberto / Rodríguez-Fuentes, Luis Javier / Penagarikano, Mikel / Varona, Amparo / Bordel, Germán (2013): "On the calibration and fusion of heterogeneous spoken term detection systems", In INTERSPEECH-2013, 20-24.