Method Overview:
Firstly, we extract features of both image and text from the pre-trained CLIP backbone. Then, fine-grained semantic interaction is performed through a multi-modal fusion transformer encoder. It simultaneously extracts multi-modal fusion semantics and constructs a common cross-modal subspace, while enhances the semantic interaction of heterogeneous modalities. We further present a contrastive hash learning module to generate semantic preserved hash codes.