When CLIP meets cross-modal hashing retrieval: A new strong baseline

Xinyu Xia1,2, Guohua Dong1, Fengling Li3, Lei Zhu2, Xiaomin Ying1.
1Center for Computational Biology, Beijing Institute of Basic Medical Sciences 2School of Information Science and Engineering, Shandong Normal University 3Australian Artificial Intelligence Institute, Faculty of Engineering and Information Technology, University of Technology Sydney

Abstract

Recent days witness significant progress in various multi-modal tasks made by Contrastive Language-Image Pre-training (CLIP), a multi-modal large-scale model that learns visual representations from natural language supervision. However, the potential effects of CLIP on cross-modal hashing retrieval has not been investigated yet. In this paper, we for the first time explore the effects of CLIP on cross-modal hashing retrieval performance and propose a simple but strong baseline Unsupervised Contrastive Multi-modal Fusion Hashing network (UCMFH). We first extract the off-the-shelf visual and linguistic features from the CLIP model, as the input sources for cross-modal hashing functions. To further mitigate the semantic gap between the image and text features, we design an effective contrastive multi-modal learning module that leverages a multi-modal fusion transformer encoder supervising by a contrastive loss, to enhance modality interaction while improving the semantic representation of each modality. Furthermore, we design a contrastive hash learning module to produce high-quality modal-correlated hash codes. Experiments show that significant performance improvement can be made by our simple new unsupervised baseline UCMFH compared with state-of-the-art supervised and unsupervised cross-modal hashing methods. Also, our experiments demonstrate the remarkable performance of CLIP features on cross-modal hashing retrieval task compared to deep visual and linguistic features used in existing state-of-the-art methods. The source codes for our approach is publicly available at: https://github.com/XinyuXia97/UCMFH.

Method Overview:

Unsupervised Contrastive Multi-modal Fusion Hashing network

Firstly, we extract features of both image and text from the pre-trained CLIP backbone. Then, fine-grained semantic interaction is performed through a multi-modal fusion transformer encoder. It simultaneously extracts multi-modal fusion semantics and constructs a common cross-modal subspace, while enhances the semantic interaction of heterogeneous modalities. We further present a contrastive hash learning module to generate semantic preserved hash codes.

BibTeX

@article{xia2023clip,
      title={When CLIP meets cross-modal hashing retrieval: A new strong baseline},
      author={Xia, Xinyu and Dong, Guohua and Li, Fengling and Zhu, Lei and Ying, Xiaomin},
      journal={Information Fusion},
      volume={100},
      pages={101968},
      year={2023},
      publisher={Elsevier}
    }