Diversity is crucial to reducing the word error rate (WER) when fusing multiple automatic speech recognition (ASR) systems. We present an empirical analysis linking diversity and fusion performance. We transcribed speech from the first 2012 US Presidential debate using multiple ASR systems trained with the Kaldi toolkit. We used the N-best ROVER algorithm to perform hypothesis fusion and measured N-best diversity by the average pairwise WER. We make three key observations. We first note that the WER of the fused hypothesis decreases significantly with increasing diversity of the N-best list. This decrease is greater than the decrease inWER of the oracle hypothesis in the list. N-best lists from systems trained on different data sets are the most diverse and give the lowest WER upon fusion. We then observe that the benefit of diversity depends on the choice of the fusion scheme. We show that confidence-weighted ROVER is able to better exploit diversity than unweighted ROVER and gives lower WERs. We finally explain the above observations by a simple linear relation linking diversity to the ROVER WER. This relation depends on the fusion scheme and also reveals the tradeoff between diversity and average WER of hypotheses in the N-best list.
Bibliographic reference. Audhkhasi, Kartik / Zavou, Andreas M. / Georgiou, Panayiotis G. / Narayanan, Shrikanth (2013): "Empirical link between hypothesis diversity and fusion performance in an ensemble of automatic speech recognition systems", In INTERSPEECH-2013, 3082-3086.