This paper presents a confidence measure for speech indexing that aims to predict the indexing quality of a speech document for a Spoken Document Retrieval (SDR) task. We first introduce how the indexing quality of a speech document is evaluated. Then, we present our method to predict the indexing quality of a speech document. It is based on confidence measure provided by an automatic speech recognition system and the detection of semantic outliers implemented with the Latent Dirichlet Allocation (LDA) model. Experiments are conducted on the French Broadcast news campaign ESTER2 in a classical SDR scenario where users submit text-queries to a search engine. Results demonstrate an overall improvement when the detection is done with the LDA model. The detection rate is always above 70%.
Index Terms: speech indexing, confidence measure, spoken document retrieval, latent dirichlet allocation
Bibliographic reference. Senay, Grégory / Linarès, Georges (2012): "Confidence measure for speech indexing based on latent dirichlet allocation", In INTERSPEECH-2012, 2302-2305.