It is very attractive but technically very challenging if the user can retrieve photos from a huge collection using high-level personal queries (e.q. Uncle Bill's House). This paper proposes a set of approaches to achieve the goal assuming only 30% of the photos are annotated by sparse and spontaneously spoken descriptions when the photos are taken. We fuse the visual features (visual words plus global visual concepts from Columbia 374 detector) with the sparse speech features and train latent topics for the photos in the collection using non-negative matrix factorization. The retrieved results are then enhanced by two-layer mutually reinforced random walk based on different types of features. In this way it becomes possible to retrieve photos without speech annotation or with sparse annotations regarding different categories of information (e.g. where and who) because of the fused visual/speech features and jointly trained latent topics. Very encouraging results were obtained in initial experiments.
Bibliographic reference. Liou, Yuan-ming / Fu, Yi-sheng / Lee, Hung-yi / Lee, Lin-shan (2014): "Semantic retrieval of personal photos using matrix factorization and two-layer random walk fusing sparse speech annotations with visual features", In INTERSPEECH-2014, 1762-1766.