The performance of image retrieval depends critically on the semantic representation and the distance function used to estimate the similarity of two images. A good representation should integrate multiple visual and textual (e.g., tag) features and offer a step closer to the true semantics of interest (e.g., concepts). As the distance function operates on the representation, they are interdependent, and thus should be addressed at the same time. We propose a probabilistic solution to learn both the representation and distance function from data. We use a versatile model known as Mixed-Variate Restricted Boltzmann Machine (MV.RBM) with sparse group idea for learning sparse high-level representation from multiple low-level feature types and modalities. The learning is regularised so that the learned representation and information-theoretic metric will (i) preserve the regularities of the visio-textual space, (ii) enhance locally structural sparsity, (iii) encourage small intra-concept distances, and (iv) keep inter-concept images separated. We demonstrate the capacity of our method on the NUS-WIDE data. For the well-studied 13-animal subset, ours outperforms state-of-the-art methods. On the subset of single-concept images, we obtain significant gains in MAP score with 79.2 with 45.5% from learning high-level representation and distance metric.
[ bib | .pdf ]