FG-CLIP: Fine-Grained Visual and Textual Alignment
- URL: http://arxiv.org/abs/2505.05071v3
- Date: Wed, 21 May 2025 06:18:41 GMT
- Title: FG-CLIP: Fine-Grained Visual and Textual Alignment
- Authors: Chunyu Xie, Bin Wang, Fanjing Kong, Jincheng Li, Dawei Liang, Gengshen Zhang, Dawei Leng, Yuhui Yin,
- Abstract summary: We propose Fine-Grained CLIP, which enhances fine-grained understanding through three key innovations.<n>We leverage large multimodal models to generate 1.6 billion long caption-image pairs for capturing global-level semantic details.<n>We construct a comprehensive dataset, termed FineHARD, by integrating high-quality region-specific annotations with hard fine-grained negative samples.
- Score: 3.830067625507938
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Contrastive Language-Image Pre-training (CLIP) excels in multimodal tasks such as image-text retrieval and zero-shot classification but struggles with fine-grained understanding due to its focus on coarse-grained short captions. To address this, we propose Fine-Grained CLIP (FG-CLIP), which enhances fine-grained understanding through three key innovations. First, we leverage large multimodal models to generate 1.6 billion long caption-image pairs for capturing global-level semantic details. Second, a high-quality dataset is constructed with 12 million images and 40 million region-specific bounding boxes aligned with detailed captions to ensure precise, context-rich representations. Third, 10 million hard fine-grained negative samples are incorporated to improve the model's ability to distinguish subtle semantic differences. We construct a comprehensive dataset, termed FineHARD, by integrating high-quality region-specific annotations with hard fine-grained negative samples. Corresponding training methods are meticulously designed for these data. Extensive experiments demonstrate that FG-CLIP outperforms the original CLIP and other state-of-the-art methods across various downstream tasks, including fine-grained understanding, open-vocabulary object detection, image-text retrieval, and general multimodal benchmarks. These results highlight FG-CLIP's effectiveness in capturing fine-grained image details and improving overall model performance. The data, code, and models are available at https://github.com/360CVGroup/FG-CLIP.
Related papers
- HQ-CLIP: Leveraging Large Vision-Language Models to Create High-Quality Image-Text Datasets and CLIP Models [15.877790469608662]
We introduce an LVLM-driven data refinement pipeline to enhance the quality of image-text pair data.<n>We propose a training paradigm that extends conventional contrastive learning by incorporating negative descriptions and short tags.<n>Our approach achieves state-of-the-art performance in zero-shot classification, cross-modal retrieval, and fine-grained visual understanding tasks.
arXiv Detail & Related papers (2025-07-30T07:21:36Z) - TULIP: Towards Unified Language-Image Pretraining [60.99500935831526]
We introduce T, an open-source, drop-in replacement for existing CLIP-like models.<n>Our method leverages generative data augmentation, enhanced image-image and text-text contrastive learning, and image/text reconstruction regularization to learn fine-grained visual features.<n>Our approach, scaling to over 1B parameters, outperforms existing state-of-the-art (SOTA) models across benchmarks.
arXiv Detail & Related papers (2025-03-19T17:58:57Z) - DiffCLIP: Few-shot Language-driven Multimodal Classifier [19.145645804307566]
DiffCLIP is a novel framework that extends Contrastive Language-Image Pretraining.<n>It conveys comprehensive language-driven semantic information for accurate classification of high-dimensional multimodal remote sensing images.<n>DiffCLIP achieves an overall accuracy improvement of 10.65% across three remote sensing datasets compared with CLIP.
arXiv Detail & Related papers (2024-12-10T02:21:39Z) - FLAIR: VLM with Fine-grained Language-informed Image Representations [49.2684130383925]
FLAIR is an approach that utilizes long and detailed image descriptions to learn localized image embeddings.<n>Our experiments demonstrate the effectiveness of FLAIR trained on 30M image-text pairs in capturing fine-grained visual information.
arXiv Detail & Related papers (2024-12-04T18:56:04Z) - Advancing Myopia To Holism: Fully Contrastive Language-Image Pre-training [30.071860810401933]
This paper advances contrastive language-image pre-training (CLIP) into one novel holistic paradigm.<n>We use image-to-text captioning to generate multi-texts for each image, from multiple perspectives, granularities, and hierarchies.<n>Our holistic CLIP significantly outperforms existing CLIP, including image-text retrieval, open-vocabulary classification, and dense visual tasks.
arXiv Detail & Related papers (2024-11-30T11:27:58Z) - Leveraging Cross-Modal Neighbor Representation for Improved CLIP Classification [54.96876797812238]
We present a novel CrOss-moDal nEighbor Representation(CODER) based on the distance structure between images and their neighbor texts.
The key to construct a high-quality CODER lies in how to create a vast amount of high-quality and diverse texts to match with images.
Experiment results across various datasets and models confirm CODER's effectiveness.
arXiv Detail & Related papers (2024-04-27T02:04:36Z) - Improving fine-grained understanding in image-text pre-training [37.163228122323865]
We introduce SPARse Fine-grained Contrastive Alignment (SPARC), a simple method for pretraining more fine-grained multimodal representations from image-text pairs.
We show improved performance over competing approaches over both image-level tasks relying on coarse-grained information.
arXiv Detail & Related papers (2024-01-18T10:28:45Z) - VeCLIP: Improving CLIP Training via Visual-enriched Captions [63.547204530720705]
This study introduces a scalable pipeline for noisy caption rewriting.
We emphasize the incorporation of visual concepts into captions, termed as Visual-enriched Captions (VeCap)
We showcase the adaptation of this method for training CLIP on large-scale web-crawled datasets, termed VeCLIP.
arXiv Detail & Related papers (2023-10-11T17:49:13Z) - Self-Supervised Image Captioning with CLIP [0.0]
We introduce a self-supervised image captioning method.
After learning an initial signal from a small labeled dataset, our method transitions to self-supervised learning on unlabeled data.
Despite utilizing less than 2% of the labeled COCO dataset, our method delivers a performance comparable to state-of-the-art models trained on the complete dataset.
arXiv Detail & Related papers (2023-06-26T23:29:16Z) - DetCLIPv2: Scalable Open-Vocabulary Object Detection Pre-training via
Word-Region Alignment [104.54362490182335]
DetCLIPv2 is an efficient training framework that incorporates large-scale image-text pairs to achieve open-vocabulary object detection.
DetCLIPv2 directly learns the fine-grained word-region alignment from massive image-text pairs in an end-to-end manner.
With 13M image-text pairs for pre-training, DetCLIPv2 demonstrates superior open-vocabulary detection performance.
arXiv Detail & Related papers (2023-04-10T11:08:15Z) - PyramidCLIP: Hierarchical Feature Alignment for Vision-language Model
Pretraining [68.84339672878066]
We introduce PyramidCLIP, which constructs an input pyramid with different semantic levels, and aligns visual elements and linguistic elements in the form of hierarchy.
Experiments on three downstream tasks, including zero-shot image classification, zero-shot image-text retrieval and image object detection, verify the effectiveness of the proposed PyramidCLIP.
arXiv Detail & Related papers (2022-04-29T13:38:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.