CLIP Multi-modal Hashing: A new baseline CLIPMH
- URL: http://arxiv.org/abs/2308.11797v1
- Date: Tue, 22 Aug 2023 21:29:55 GMT
- Title: CLIP Multi-modal Hashing: A new baseline CLIPMH
- Authors: Jian Zhu, Mingkai Sheng, Mingda Ke, Zhangmin Huang, Jingfei Chang
- Abstract summary: We propose a new baseline CLIP Multi-modal Hashing ( CLIPMH) method.
It uses CLIP model to extract text and image features, and then fuse to generate hash code.
In comparison to state-of-the-art unsupervised and supervised multi-modal hashing methods, experiments reveal that the proposed CLIPMH can significantly enhance performance.
- Score: 4.057431980018267
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The multi-modal hashing method is widely used in multimedia retrieval. It can
fuse multi-source data to generate binary hash code. However, the current
multi-modal methods have the problem of low retrieval accuracy. The reason is
that the individual backbone networks have limited feature expression
capabilities and are not jointly pre-trained on large-scale unsupervised
multi-modal data. To solve this problem, we propose a new baseline CLIP
Multi-modal Hashing (CLIPMH) method. It uses CLIP model to extract text and
image features, and then fuse to generate hash code. CLIP improves the
expressiveness of each modal feature. In this way, it can greatly improve the
retrieval performance of multi-modal hashing methods. In comparison to
state-of-the-art unsupervised and supervised multi-modal hashing methods,
experiments reveal that the proposed CLIPMH can significantly enhance
performance (Maximum increase of 8.38%). CLIP also has great advantages over
the text and visual backbone networks commonly used before.
Related papers
- Multimodality Helps Few-Shot 3D Point Cloud Semantic Segmentation [61.91492500828508]
Few-shot 3D point cloud segmentation (FS-PCS) aims at generalizing models to segment novel categories with minimal support samples.
We introduce a cost-free multimodal FS-PCS setup, utilizing textual labels and the potentially available 2D image modality.
We propose a simple yet effective Test-time Adaptive Cross-modal Seg (TACC) technique to mitigate training bias.
arXiv Detail & Related papers (2024-10-29T19:28:41Z) - CLIP Multi-modal Hashing for Multimedia Retrieval [7.2683522480676395]
We propose a novel CLIP Multi-modal Hashing ( CLIPMH) method.
Our method employs the CLIP framework to extract both text and vision features and then fuses them to generate hash code.
Compared with state-of-the-art unsupervised and supervised multi-modal hashing methods, experiments reveal that the proposed CLIPMH can significantly improve performance.
arXiv Detail & Related papers (2024-10-10T10:13:48Z) - Multimodal Learned Sparse Retrieval with Probabilistic Expansion Control [66.78146440275093]
Learned retrieval (LSR) is a family of neural methods that encode queries and documents into sparse lexical vectors.
We explore the application of LSR to the multi-modal domain, with a focus on text-image retrieval.
Current approaches like LexLIP and STAIR require complex multi-step training on massive datasets.
Our proposed approach efficiently transforms dense vectors from a frozen dense model into sparse lexical vectors.
arXiv Detail & Related papers (2024-02-27T14:21:56Z) - Asymmetric Scalable Cross-modal Hashing [51.309905690367835]
Cross-modal hashing is a successful method to solve large-scale multimedia retrieval issue.
We propose a novel Asymmetric Scalable Cross-Modal Hashing (ASCMH) to address these issues.
Our ASCMH outperforms the state-of-the-art cross-modal hashing methods in terms of accuracy and efficiency.
arXiv Detail & Related papers (2022-07-26T04:38:47Z) - Multimodal Fake News Detection via CLIP-Guided Learning [26.093561485807832]
This paper proposes a FND-CLIP framework, i.e., a multimodal Fake News Detection network based on Contrastive Language-Image Pretraining (CLIP)
Given a targeted multimodal news, we extract the deep representations from the image and text using a ResNet-based encoder, a BERT-based encoder and two pair-wise CLIP encoders.
The multimodal feature is a concatenation of the CLIP-generated features weighted by the standardized cross-modal similarity of the two modalities.
arXiv Detail & Related papers (2022-05-28T02:43:18Z) - PHPQ: Pyramid Hybrid Pooling Quantization for Efficient Fine-Grained
Image Retrieval [68.05570413133462]
We propose a Pyramid Hybrid Pooling Quantization (PHPQ) module to capture and preserve fine-grained semantic information from multi-level features.
Experiments on two widely-used public benchmarks, CUB-200-2011 and Stanford Dogs, demonstrate that PHPQ outperforms state-of-the-art methods.
arXiv Detail & Related papers (2021-09-11T07:21:02Z) - Online Enhanced Semantic Hashing: Towards Effective and Efficient
Retrieval for Streaming Multi-Modal Data [21.157717777481572]
We propose a new model, termed Online enhAnced SemantIc haShing (OASIS)
We design novel semantic-enhanced representation for data, which could help handle the new coming classes.
Our method can exceed the state-of-the-art models.
arXiv Detail & Related papers (2021-09-09T13:30:31Z) - MOON: Multi-Hash Codes Joint Learning for Cross-Media Retrieval [30.77157852327981]
Cross-media hashing technique has attracted increasing attention for its high computation efficiency and low storage cost.
We develop a novel Multiple hash cOdes jOint learNing method (MOON) for cross-media retrieval.
arXiv Detail & Related papers (2021-08-17T14:47:47Z) - Unsupervised Multi-Index Semantic Hashing [23.169142004594434]
We propose an unsupervised hashing model that learns hash codes that are both effective and highly efficient by being optimized for multi-index hashing.
We experimentally compare MISH to state-of-the-art semantic hashing baselines in the task of document similarity search.
We find that even though multi-index hashing also improves the efficiency of the baselines compared to a linear scan, they are still upwards of 33% slower than MISH.
arXiv Detail & Related papers (2021-03-26T13:33:48Z) - Unsupervised Deep Cross-modality Spectral Hashing [65.3842441716661]
The framework is a two-step hashing approach which decouples the optimization into binary optimization and hashing function learning.
We propose a novel spectral embedding-based algorithm to simultaneously learn single-modality and binary cross-modality representations.
We leverage the powerful CNN for images and propose a CNN-based deep architecture to learn text modality.
arXiv Detail & Related papers (2020-08-01T09:20:11Z) - Creating Something from Nothing: Unsupervised Knowledge Distillation for
Cross-Modal Hashing [132.22315429623575]
Cross-modal hashing (CMH) can map contents from different modalities, especially in vision and language, into the same space.
There are two main frameworks for CMH, differing from each other in whether semantic supervision is required.
In this paper, we propose a novel approach that enables guiding a supervised method using outputs produced by an unsupervised method.
arXiv Detail & Related papers (2020-04-01T08:32:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.