Comparison of Multiple System Combination Techniques for Keyword Spotting

William Hartmann, Le Zhang, Kerri Barnes, Roger Hsiao, Stavros Tsakalidis, Richard Schwartz


System combination is a common approach to improving results for both speech transcription and keyword spotting — especially in the context of low-resourced languages where building multiple complementary models requires less computational effort. Using state-of-the-art CNN and DNN acoustic models, we analyze the performance, cost, and trade-offs of four system combination approaches: feature combination, joint decoding, hitlist combination, and a novel lattice combination method. Previous work has focused solely on accuracy comparisons. We show that joint decoding, lattice combination, and hitlist combination perform comparably, significantly better than feature combination. However, for practical systems, earlier combination reduces computational cost and storage requirements. Results are reported on four languages from the IARPA Babel dataset.


DOI: 10.21437/Interspeech.2016-1381

Cite as

Hartmann, W., Zhang, L., Barnes, K., Hsiao, R., Tsakalidis, S., Schwartz, R. (2016) Comparison of Multiple System Combination Techniques for Keyword Spotting. Proc. Interspeech 2016, 1913-1917.

Bibtex
@inproceedings{Hartmann+2016,
author={William Hartmann and Le Zhang and Kerri Barnes and Roger Hsiao and Stavros Tsakalidis and Richard Schwartz},
title={Comparison of Multiple System Combination Techniques for Keyword Spotting},
year=2016,
booktitle={Interspeech 2016},
doi={10.21437/Interspeech.2016-1381},
url={http://dx.doi.org/10.21437/Interspeech.2016-1381},
pages={1913--1917}
}