PyramidCLIP: Hierarchical Feature Alignment for Vision-language Model
Pretraining
- URL: http://arxiv.org/abs/2204.14095v1
- Date: Fri, 29 Apr 2022 13:38:42 GMT
- Title: PyramidCLIP: Hierarchical Feature Alignment for Vision-language Model
Pretraining
- Authors: Yuting Gao, Jinfeng Liu, Zihan Xu, Jun Zhang, Ke Li, Chunhua Shen
- Abstract summary: We introduce PyramidCLIP, which constructs an input pyramid with different semantic levels, and aligns visual elements and linguistic elements in the form of hierarchy.
Experiments on three downstream tasks, including zero-shot image classification, zero-shot image-text retrieval and image object detection, verify the effectiveness of the proposed PyramidCLIP.
- Score: 68.84339672878066
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large-scale vision-language pre-training has achieved promising results on
downstream tasks. Existing methods highly rely on the assumption that the
image-text pairs crawled from the Internet are in perfect one-to-one
correspondence. However, in real scenarios, this assumption can be difficult to
hold: the text description, obtained by crawling the affiliated metadata of the
image, often suffer from semantic mismatch and mutual compatibility. To address
these issues, here we introduce PyramidCLIP, which constructs an input pyramid
with different semantic levels, and aligns visual elements and linguistic
elements in the form of hierarchy via intra-level semantics alignment and
cross-level relation alignment. Furthermore, we adjust the objective function
by softening the loss of negative samples (unpaired samples) so as to weaken
the strict constraint during the pre-training stage, thus mitigating the risk
of the model being over-confident. Experiments on three downstream tasks,
including zero-shot image classification, zero-shot image-text retrieval and
image object detection, verify the effectiveness of the proposed PyramidCLIP.
In particular, with the same amount of pre-training data of 15 millions
image-text pairs, PyramidCLIP exceeds CLIP by 19.2%/18.5%/19.6% respectively,
with the image encoder being ResNet-50/ViT-B32/ViT-B16 on ImageNet zero-shot
classification top-1 accuracy. When scaling to larger datasets, the results of
PyramidCLIP only trained for 8 epochs using 128M image-text pairs are very
close to that of CLIP trained for 32 epochs using 400M training data.
Related papers
- TripletCLIP: Improving Compositional Reasoning of CLIP via Synthetic Vision-Language Negatives [65.82577305915643]
Contrastive Language-Image Pretraining (CLIP) models maximize the mutual information between text and visual modalities to learn representations.
We show that generating hard'' negative captions via in-context learning and corresponding negative images with text-to-image generators offers a solution.
We demonstrate that our method, named TripletCLIP, enhances the compositional capabilities of CLIP, resulting in an absolute improvement of over 9% on the SugarCrepe benchmark.
arXiv Detail & Related papers (2024-11-04T19:24:59Z) - Improving fine-grained understanding in image-text pre-training [37.163228122323865]
We introduce SPARse Fine-grained Contrastive Alignment (SPARC), a simple method for pretraining more fine-grained multimodal representations from image-text pairs.
We show improved performance over competing approaches over both image-level tasks relying on coarse-grained information.
arXiv Detail & Related papers (2024-01-18T10:28:45Z) - VeCLIP: Improving CLIP Training via Visual-enriched Captions [63.547204530720705]
This study introduces a scalable pipeline for noisy caption rewriting.
We emphasize the incorporation of visual concepts into captions, termed as Visual-enriched Captions (VeCap)
We showcase the adaptation of this method for training CLIP on large-scale web-crawled datasets, termed VeCLIP.
arXiv Detail & Related papers (2023-10-11T17:49:13Z) - Sieve: Multimodal Dataset Pruning Using Image Captioning Models [11.362835828985494]
Vision-Language Models (VLMs) are pretrained on large, diverse, and noisy web-crawled datasets.
We argue that this approach suffers from multiple limitations including false positives and negatives due to CLIP's pretraining on noisy labels.
We propose a pruning signal, Sieve, that employs synthetic captions generated by image-captioning models pretrained on small, diverse, and well-aligned image-text pairs.
arXiv Detail & Related papers (2023-10-03T14:53:53Z) - ALIP: Adaptive Language-Image Pre-training with Synthetic Caption [78.93535202851278]
Contrastive Language-Image Pre-training (CLIP) has significantly boosted the performance of various vision-language tasks.
The presence of intrinsic noise and unmatched image-text pairs in web data can potentially affect the performance of representation learning.
We propose an Adaptive Language-Image Pre-training (ALIP), a bi-path model that integrates supervision from both raw text and synthetic caption.
arXiv Detail & Related papers (2023-08-16T15:19:52Z) - DetCLIPv2: Scalable Open-Vocabulary Object Detection Pre-training via
Word-Region Alignment [104.54362490182335]
DetCLIPv2 is an efficient training framework that incorporates large-scale image-text pairs to achieve open-vocabulary object detection.
DetCLIPv2 directly learns the fine-grained word-region alignment from massive image-text pairs in an end-to-end manner.
With 13M image-text pairs for pre-training, DetCLIPv2 demonstrates superior open-vocabulary detection performance.
arXiv Detail & Related papers (2023-04-10T11:08:15Z) - Prefix Conditioning Unifies Language and Label Supervision [84.11127588805138]
We show that dataset biases negatively affect pre-training by reducing the generalizability of learned representations.
In experiments, we show that this simple technique improves the performance in zero-shot image recognition accuracy and robustness to the image-level distribution shift.
arXiv Detail & Related papers (2022-06-02T16:12:26Z) - CyCLIP: Cyclic Contrastive Language-Image Pretraining [34.588147979731374]
Recent advances in contrastive representation learning over paired image-text data have led to models such as CLIP that achieve state-of-the-art performance for zero-shot classification and distributional robustness.
We demonstrate that the image and text representations learned via a standard contrastive objective are not interchangeable and can lead to inconsistent downstream predictions.
We propose CyCLIP, a framework for contrastive representation learning that explicitly optimize for the learned representations to be geometrically consistent in the image and text space.
arXiv Detail & Related papers (2022-05-28T15:31:17Z) - Data Efficient Language-supervised Zero-shot Recognition with Optimal
Transport Distillation [43.03533959429743]
We propose OTTER, which uses online optimal transport to find a soft image-text match as labels for contrastive learning.
Based on pretrained image and text encoders, models trained with OTTER achieve strong performance with only 3M image text pairs.
arXiv Detail & Related papers (2021-12-17T11:27:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.