Local audio-visual descriptors are often compactly stored
using representations such as the soft quantization histogram .
Typically, classification performance with
histogram representations is improved through the use of
large codeword sets. Unfortunately, this approach runs
into overfitting and scalability challenges when applied
to richly diverse real-world collections.
A novel i-vector approach was recently proposed for the speaker-verification task . In this work, we study the relative effectiveness of the i-vector as a compact representation of local audio descriptors (e.g., MFCC's) within a multimedia event detection system. Specifically, we model the local audio descriptors using a Guassian Mixture Model (GMM). Following , we constrain theGMMparameters to a low-dimensional subspace while preserving most of the variability (i.e., information) in the descriptors. The GMM parameters in the subspace constitute a compact representation that exhibits robustness in the face of sparse data.
We evaluate the method by performing the multimedia event detection (MED) task using only audio information within consumer (e.g., YouTube) videos. Experiments with the 2011 TRECVID MED data show that the i-vector provides superior performance and lower dimensionality than the bag-of-words soft quantization histograms used in the state-of-the-art BBN VISER system in the 2011 TRECVID MED Evaluation.
Index Terms: multimedia event detection, factor analysis
Bibliographic reference. Zhuang, Xiaodan / Tsakalidis, Stavros / Wu, Shuang / Natarajan, Pradeep / Prasad, Rohit / Natarajan, Prem (2012): "Compact audio representation for event detection in consumer media", In INTERSPEECH-2012, 2089-2092.